You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jieshan Bean (JIRA)" <ji...@apache.org> on 2011/05/24 03:51:53 UTC

[jira] [Created] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

ROOT region appeared in two regionserver's onlineRegions at the same time
-------------------------------------------------------------------------

                 Key: HBASE-3914
                 URL: https://issues.apache.org/jira/browse/HBASE-3914
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 0.90.3
            Reporter: Jieshan Bean
             Fix For: 0.90.4


This could be happen under the following steps with little probability:
(I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)

1.Root region was opened in RS1.
2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
3.ServerShutdownHandler process start.
4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
5.Root region was opened successfully in RS2.
6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
 Case a:
   ROOT region was assigned to RS1. 
   It seemed nothing would be affected. But the root region was still online in RS2.  
   
 Case b:
   ROOT region was assigned to RS2.    
   The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.

This could be proved from the logs:

1. ROOT region was opened with two times:
2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212

2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server

This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.

There's 2 references about assignRoot():

1.
HMaster# assignRootAndMeta:

    if (!catalogTracker.verifyRootRegionLocation(timeout)) {
      this.assignmentManager.assignRoot();
      this.catalogTracker.waitForRoot();
      assigned++;
    }

2.
ServerShutdownHandler# process: 
    
      if (isCarryingRoot()) { // -ROOT-      
        try {        
           this.services.getAssignmentManager().assignRoot();
        } catch (KeeperException e) {
           this.server.abort("In server shutdown processing, assigning root", e);
           throw new IOException("Aborting", e);
        }
      }    

I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.





--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038411#comment-13038411 ] 

Jieshan Bean commented on HBASE-3914:
-------------------------------------

Thanks for your quickly comments about this patch,Stack.
1."hbase.catalog.verification.timeout" is not new, about that, I referenced the original code(In method HMaster#assignRootAndMeta), any other place use it like this, so I just followed that.
   
2.I have thought about add this method in ServerShutdownHandler, and indeed if it is an private method will be better.

I will make the change immediately :)
   

> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "mingjian (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146111#comment-13146111 ] 

mingjian commented on HBASE-3914:
---------------------------------

@stack: The following is our master log:
{noformat} 
2011-10-19 19:13:34,873 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_META_SERVER_S
HUTDOWN
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.ipc.ServerNotRunningException: Server is not running yet
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1090)

        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:256)
        at $Proxy7.getRegionInfo(Unknown Source)
        at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:424)
        at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(CatalogTracker.java:471)
        at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRoot(ServerShutdownHandler.java:90)
        at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:126)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662){noformat} 

After this, -ROOT-'s region won't be assigned, like this:
{noformat} 
2011-10-19 19:18:40,000 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: locateRegionInMeta parent
Table=-ROOT-, metaLocation=address: dw79.kgb.sqa.cm4:60020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after s
leep of 1000 because: org.apache.hadoop.hbase.NotServingRegionException: Region is not online: -ROOT-,,0
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2771)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getClosestRowBefore(HRegionServer.java:1802)
        at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:569)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1091)
{noformat}
So we should rewrite the verify method both in branch-0.90 and trunk
                
> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039403#comment-13039403 ] 

Hudson commented on HBASE-3914:
-------------------------------

Integrated in HBase-TRUNK #1939 (See [https://builds.apache.org/hudson/job/HBase-TRUNK/1939/])
    

> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038956#comment-13038956 ] 

stack commented on HBASE-3914:
------------------------------

bq. In future, I will notice what you remind me, and I also hope I can contribute more patches.

No problem and keep the patches coming!

> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-3914:
--------------------------------

    Status: Patch Available  (was: Open)

In ServerShutDownHandler# process, if this server carrying the ROOT region, before the assigning, verify the ROOT region's location first.


> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-3914:
--------------------------------

    Attachment: HBASE-3914-V2.patch

If possible, I hope this patch could be reviewed quickly, thanks:)

> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-3914:
--------------------------------

    Attachment: HBASE-3914.patch

> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "mingjian (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146748#comment-13146748 ] 

mingjian commented on HBASE-3914:
---------------------------------

@stack:
I created HBASE-4762.
                
> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038887#comment-13038887 ] 

Jieshan Bean commented on HBASE-3914:
-------------------------------------

Thanks stack.
In future, I will notice what you remind me, and I also hope I can contribute more patches.
:)

> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "mingjian (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146062#comment-13146062 ] 

mingjian commented on HBASE-3914:
---------------------------------

It seemed like root region will never be assigned if verifyRootRegionLocation throws IOE.
                
> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146076#comment-13146076 ] 

stack commented on HBASE-3914:
------------------------------

@mingjian Do you have a log?  Do you want to open new issue?
                
> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038395#comment-13038395 ] 

stack commented on HBASE-3914:
------------------------------

Thanks for looking into this Jieshan.

Why not keep the change local to ServerShutdownHandler?  If it is the only class that uses the new verifyAndAssignRoot, why not make it private to the ServerShutdownHandler class?

base.catalog.verification.timeout is new?  And default is 1 second only?  Is that enough time?

Good stuff

> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146330#comment-13146330 ] 

stack commented on HBASE-3914:
------------------------------

Can we do this discussion in a new issue Mingjie?  Mind opening one?

Is the above log from a 0.90 or a 0.92/trunk?  Thanks.
                
> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3914:
-------------------------

      Resolution: Fixed
        Assignee: Jieshan Bean
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed to trunk and branch.  Thanks for the patch Jieshan (I added you as a contributor and assigned you this issue).  FYI in future, please follow the coding convention of the code that surrounds your additions; e.g. two spaces for tabs, lines < 80 characters, and notice how curly braces are put at the end of the line rather than on a new line when adding an exception clause.

Good stuff.

> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira