You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org> on 2011/12/23 09:44:30 UTC

[jira] [Commented] (HBASE-5092) Two adjacent assignments lead region is in PENDING_OPEN state and block table disable and enable actions.

    [ https://issues.apache.org/jira/browse/HBASE-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175330#comment-13175330 ] 

ramkrishna.s.vasudevan commented on HBASE-5092:
-----------------------------------------------

@Liu
Thanks for your analysis Liu and for the patch.
{code}
if (!transitionZookeeperOfflineToOpening(encodedName,
          versionOfOfflineNode)) {
        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
          encodedName);
        tryTransitionToFailedOpen(regionInfo);
        return;
      }
{code}
This fix may solve the problem in one case where the RIT exception is thrown as the second assign request goes to the same RS.  But this fix may not be correct when the second assign is going to the other RS.

The first RS request will try to change the node from OFFLINE to FAILED_OPEN.  The second RS open request will expect the node state to be in OFFLINE but it will fail.  So once again the the assign retry should assign the region to another RS.

As part of HBASE-4153 it was fixed like RIT exception we will not retry specifically for the case when assign() is triggered externally. Lets wait for others suggestion also.





                
> Two adjacent assignments lead region is in PENDING_OPEN state and block table disable and enable actions.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5092
>                 URL: https://issues.apache.org/jira/browse/HBASE-5092
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.92.0
>            Reporter: Liu Jia
>            Assignee: Liu Jia
>         Attachments: unhandled_PENDING_OPEN_lead_by_two_assignment.patch
>
>
>   
> Region is in PENDING_OPEN state and disable and enable are blocked.
> We occasionally find if two assignments which have a short interval time will lead to a PENDING_OPEN state staying in the regionInTransition map and blocking the disable and enable table actions.
> We found that the second assignment will set the zknode of this region to M_ZK_REGION_OFFLINE then set the state in assignmentMananger's regionInTransition map to PENDING_OPEN and abort its further operation because of finding the the region is already in the regionserver by a RegionAlreadyInTransitionException.
> At the same time the first assignment is tickleOpening and find the version of the zknode is messed up by the  second assignment, so the OpenRegionHandler print out the following two lines:
> {noformat} 
> 2011-12-23 22:12:15,197 WARN  [RS_OPEN_REGION-data16,59892,1324649528415-0] zookeeper.ZKAssign(788): regionserver:59892-0x1346b43b91e0002 Attempt to transition the unassigned node for 15237599c632752b8cfd3d5a86349768 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING failed, the node existed but was version 2 not the expected version 1
> 2011-12-23 22:12:15,197 WARN  [RS_OPEN_REGION-data16,59892,1324649528415-0] handler.OpenRegionHandler(403): Failed refreshing OPENING; region=15237599c632752b8cfd3d5a86349768, context=post_region_open
> {noformat} 
> After that it tries to turn the state to FAILED_OPEN, but also failed due to wrong version,
> this is the output:
> {noformat} 
> 2011-12-23 22:12:15,199 WARN  [RS_OPEN_REGION-data16,59892,1324649528415-0] zookeeper.ZKAssign(812): regionserver:59892-0x1346b43b91e0002 Attempt to transition the unassigned node for 15237599c632752b8cfd3d5a86349768 from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state M_ZK_REGION_OFFLINE set by the server data16,59892,1324649528415
> 2011-12-23 22:12:15,199 WARN  [RS_OPEN_REGION-data16,59892,1324649528415-0] handler.OpenRegionHandler(307): Unable to mark region {NAME => 'table1,,1324649533045.15237599c632752b8cfd3d5a86349768.', STARTKEY => '', ENDKEY => '', ENCODED => 15237599c632752b8cfd3d5a86349768,} as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus another RS already has the region.
> {noformat} 
> So after all that, the PENDING_OPEN state is left in the assignmentMananger's regionInTransition map and none will deal with it further,
> This kind of situation will wait until the master find the state out of time.
> The following is the test code:
> {code:title=test.java|borderStyle=solid}
> @Test
>   public void testDisableTables() throws IOException {
>     for (int i = 0; i < 20; i++) {
>       HTableDescriptor des = admin.getTableDescriptor(Bytes.toBytes(table1));
>       List<HRegionInfo> hris = TEST_UTIL.getHBaseCluster().getMaster()
>           .getAssignmentManager().getRegionsOfTable(Bytes.toBytes(table1));
>       TEST_UTIL.getHBaseCluster().getMaster()
>           .assign(hris.get(0).getRegionName());
>   
>       TEST_UTIL.getHBaseCluster().getMaster()
>           .assign(hris.get(0).getRegionName());
>   
>       admin.disableTable(Bytes.toBytes(table1));
>       admin.modifyTable(Bytes.toBytes(table1), des);
>       admin.enableTable(Bytes.toBytes(table1));
>     }
>   }
> {code}
> To fix this,we add a line to 
> public static int ZKAssign.transitionNode() to make endState.RS_ZK_REGION_FAILED_OPEN transition pass.
> {code:title=ZKAssign.java|borderStyle=solid}
>    if((!existingData.getEventType().equals(beginState))
>       //add the following line to make endState.RS_ZK_REGION_FAILED_OPEN transition pass.
>       &&(!endState.equals(endState.RS_ZK_REGION_FAILED_OPEN))) {
>       LOG.warn(zkw.prefix("Attempt to transition the " +
>         "unassigned node for " + encoded +
>         " from " + beginState + " to " + endState + " failed, " +
>         "the node existed but was in the state " + existingData.getEventType() +
>         " set by the server " + serverName));
>       return -1;
>     }
> {code}
> Run the test case again we found that before the first assignment trans the state from offline to opening, the second assignment could set the state to offline again and messed up the version of zknode.
> In OpenRegionHandler.process() the following part failed and make the process() return.
> {code:title=OpenRegionHandler.java|borderStyle=solid}
>  if (!transitionZookeeperOfflineToOpening(encodedName,
>           versionOfOfflineNode)) {
>         LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>           encodedName);
>         return;
> {code}      }
> //So we add the following code to the part to make this open region process to FAILED_OPEN.
> {code:title=OpenRegionHandler.java|borderStyle=solid}
>  if (!transitionZookeeperOfflineToOpening(encodedName,
>           versionOfOfflineNode)) {
>         LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>           encodedName);
>         tryTransitionToFailedOpen(regionInfo);
>         return;
>       }
> {code}
> After the two amendments, two adjacent assignments will not lead to an unhandled PENDING_OPEN state.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira