You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "gaojinchao (Created) (JIRA)" <ji...@apache.org> on 2011/09/29 10:10:45 UTC

[jira] [Created] (HBASE-4511) There is data loss when master failovers

There is data loss when master failovers
----------------------------------------

                 Key: HBASE-4511
                 URL: https://issues.apache.org/jira/browse/HBASE-4511
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 0.92.0
            Reporter: gaojinchao
            Priority: Critical
             Fix For: 0.92.0


It goes like this:

Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
Master startups again and finds all living RS. 
Master verifies the meta failed,  because this RS is crashing.
Master reassigns the meta, but it doesn't split the Hlog. 

So some meta data is loss.

About the logs of a failover test case fail. 

//It said that we want to kill a RS

2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 

RS 192.168.2.102,54385,1317264874629 killed 

//Rs didn't crash. 
2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...

//Meta verification failed and ressigned the meta. So all the regions in the meta is loss.

2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192


//It said that Master clean the cluster.
2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144982#comment-13144982 ] 

gaojinchao commented on HBASE-4511:
-----------------------------------

+1
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144465#comment-13144465 ] 

Ted Yu commented on HBASE-4511:
-------------------------------

The proposed approach makes sense.

{code}
+  private boolean expireIfOnline(final ServerName sn) {
+    if (sn != null && !this.serverManager.isServerOnline(sn)) return false;
{code}
I think checking sn against null should be on its own. Otherwise we would get NPE at the first line of expireServer():
{code}
  public synchronized void expireServer(final ServerName serverName) {
    if (!this.onlineServers.containsKey(serverName)) {
{code}
See http://download.oracle.com/javase/1,5,0/docs/api/java/util/concurrent/ConcurrentHashMap.html:

Like Hashtable but unlike HashMap, this class does NOT allow null to be used as a key or value.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123377#comment-13123377 ] 

ramkrishna.s.vasudevan commented on HBASE-4511:
-----------------------------------------------

@Gao
This problem occured in testcase.  Can we reproduce this in real time? It would be great if we can reproduce so that we are clear of the actual problem?

                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148949#comment-13148949 ] 

Hudson commented on HBASE-4511:
-------------------------------

Integrated in HBase-0.92 #125 (See [https://builds.apache.org/job/HBase-0.92/125/])
    HBASE-4511 There is data loss when master failovers

stack : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java

                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511-v3.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "Ted Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4511:
--------------------------

    Fix Version/s:     (was: 0.94.0)
                   0.92.0
    
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4511:
-------------------------

    Attachment: 4511-v2.txt

Thank you for the log Jinchao.  That helped.  So, yes, metaLocation can be null at this point.  It won't be set until its been 'verified'.  So what we want if metaLocation is null is what is up in -ROOT- (which is what I believe you were saying above in your first feedback).

This patch adds a method to CatalogTracker that will return metaLocation if non-null else whatever the current content of -ROOT-.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144523#comment-13144523 ] 

gaojinchao commented on HBASE-4511:
-----------------------------------

Meta table is different from Root table. When Meta RS is dying, we should not use getMetaLocation.
I think we should get the sn from root table. 
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144587#comment-13144587 ] 

stack commented on HBASE-4511:
------------------------------

@Gaojinchao I don't understand your comment.  When I call getMetaLocation on line 558, am I not using same address that failed above when we called verifyMetaRegionLocation on the line above at line 557?  Thanks.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "Ted Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4511:
--------------------------

    Fix Version/s:     (was: 0.92.0)
                   0.94.0
    
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.94.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145242#comment-13145242 ] 

stack commented on HBASE-4511:
------------------------------

I like your suggestion @Ram.  New patch to follow.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123372#comment-13123372 ] 

gaojinchao commented on HBASE-4511:
-----------------------------------

@RAM 
For this case. We can process it:
1. why the Region server can't exit ?
2. If master verifies the meta/root failed. Does master need crash? or wait for ServerShutDownHandler.

                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "gaojinchao (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao updated HBASE-4511:
------------------------------

    Priority: Minor  (was: Critical)
    
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4511:
-------------------------

    Attachment: 4511-v3.txt

Version I'll apply (Has Ram's input -- though I ended up just replacing isAlreadyExpired with a test for null and equals... thats all it was doing).
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511-v3.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145239#comment-13145239 ] 

ramkrishna.s.vasudevan commented on HBASE-4511:
-----------------------------------------------

bq. So we can add isAlreadyExpiring() and if true

Sorry it should be 
So we can add isAlreadyExpiring() and if flase we can go with expireIfOnline
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148981#comment-13148981 ] 

Hudson commented on HBASE-4511:
-------------------------------

Integrated in HBase-TRUNK #2430 (See [https://builds.apache.org/job/HBase-TRUNK/2430/])
    HBASE-4511 There is data loss when master failovers

stack : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java

                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511-v3.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144585#comment-13144585 ] 

Hadoop QA commented on HBASE-4511:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12502576/4511.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/185//console

This message is automatically generated.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128554#comment-13128554 ] 

gaojinchao commented on HBASE-4511:
-----------------------------------

Ihis cannot be reproduced in real cluster and downgrade its priority.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144839#comment-13144839 ] 

Hadoop QA commented on HBASE-4511:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12502613/4511-v2.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -163 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 48 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.master.TestDistributedLogSplitting
                  org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithRemove
                  org.apache.hadoop.hbase.TestDrainingServer

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/189//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/189//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/189//console

This message is automatically generated.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144553#comment-13144553 ] 

stack commented on HBASE-4511:
------------------------------

Let me address both of your criticisms.  Thanks for the review lads.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117978#comment-13117978 ] 

ramkrishna.s.vasudevan commented on HBASE-4511:
-----------------------------------------------

@Gao
I think generally the split of log will happen only if the RS dies and the RS node gets deleted.
If that does not happen then the split logs of that crashing RS may not happen.
May be in actual scenario though the master does not process as dead server by splitting the logs it will any way wait for ServerShutDownHandler to process it.

what do you feel Gao?
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4511:
-------------------------

    Attachment: 4511.txt

Address Ted comment.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4511:
-------------------------

    Status: Patch Available  (was: Open)
    
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144598#comment-13144598 ] 

gaojinchao commented on HBASE-4511:
-----------------------------------

@stack
Sorry , I am wrong. the patch makes sense.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4511:
-------------------------

    Status: Patch Available  (was: Open)
    
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4511:
-------------------------

    Attachment: sketch.txt

I think there is a hole here as identified by Gaojinchao.  Its probably of rare incidence but its easy to reason about.

Say we fail verifying root region location when a new master joins an exisiting cluster, we will reassign it.  We may have failed though because the server has not yet expired in zk.  In this later case, the root will likely be assigned to a new server.  When the server that had been carrying the root eventually goes down, the shutdown handling code may treat it as a server that was carrying root (if root location has not yet been updated) or it may not (if root location has been updated).  In either case, its edits will likely be skipped when root opens in its new location.

Ditto for meta region.

Here is a sketch of a patch that does the following:

+ If on master start, verification of the root location fails AND the server we were verifying is a member of the online set, expire the server -- do not reassign root (let shutdown handler do the reassigning).
+ Ditto for meta (only don't re-expire a server if we just did it on root check).

Drastic but I think this 'correct'.

Patch is not working right though... need to check it out.... later.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145227#comment-13145227 ] 

ramkrishna.s.vasudevan commented on HBASE-4511:
-----------------------------------------------

@Stack
Patch looks fine. I have one suggestion
{code}
+      if (isExpiring(expiredServer, currentMetaServer) ||
+          expireIfOnline(currentMetaServer)) {
+        // We are expiring the server that is carrying meta because unreachable
+        // The expiration processing will take care of reassigning meta.
+      }
{code}

As you had clearly told if we are already expiring a server while assigning meta then we will not be expiring once again. 
So can we rename isExpiring to isAlreadyExpiring()?  Also can we split the conition because currently the if block is empty.
So we can add isAlreadyExpiring() and if true we can go with expireIfOnline.  Just a thought.  You can decide Stack. :)
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144604#comment-13144604 ] 

gaojinchao commented on HBASE-4511:
-----------------------------------

I am still a little doubt, If Meta RS is dying RS, How to pass this his name to metaLocation?

look this logs, it said that this.metaLocation is null .

2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state

                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4511:
-------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Applied branch and trunk.  Thanks for the review lads.
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511-v3.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "gaojinchao (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao updated HBASE-4511:
------------------------------

    Attachment: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar

Some logs are provided by Ted
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4511:
-------------------------

    Assignee: stack
      Status: Open  (was: Patch Available)
    
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383): Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira