You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Ming Ma (JIRA)" <ji...@apache.org> on 2011/09/20 01:23:09 UTC

[jira] [Created] (HBASE-4446) Rolling restart RS, region could stay in OPENING state

Rolling restart RS, region could stay in OPENING state
------------------------------------------------------

                 Key: HBASE-4446
                 URL: https://issues.apache.org/jira/browse/HBASE-4446
             Project: HBase
          Issue Type: Bug
          Components: master
            Reporter: Ming Ma
            Assignee: Ming Ma


Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.


2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN

The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.

processOpeningState
...
   else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
        LOG.warn("While timing out a region in state OPENING, "
            + "found ZK node in unexpected state: "
            + dataInZNode.getEventType());
        return;
      }



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "Ming Ma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108385#comment-13108385 ] 

Ming Ma commented on HBASE-4446:
--------------------------------

Good point, Todd. Thanks, Ted. Here is why the master didn't handle this. Note, part of the log below comes from the new code. The issue is by the time assignmentmanager gets the notification, the RS isn't online anymore. Thus the processing based on ZK callback is skipped.

2011-09-19 22:04:54,506 WARN org.apache.hadoop.hbase.master.AssignmentManager: Attempted to handle region transition for server but server is not online: miweng_test,1??s$? >,1316493502701.6409ae717931daee3705f3e7d33d85b5.


2011-09-19 22:22:06,561 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN region= miweng_test,1\xC8\xFAs$\xB7 >,1316493502701.6409a
e717931daee3705f3e7d33d85b5.


That also means we can fix the issue in a different way. Why does AssignmentManager.handleRegion have to inforce the following condition and rely on TimeoutMonitor and ServerShutdownHandler to kick in? At least for certain states like RS_ZK_REGION_FAILED_OPEN, RS_ZK_REGION_CLOSED, AssignmentManager.handleRegion can still process the event even though the RS is down.

      // Verify this is a known server
      if (!serverManager.isServerOnline(sn) &&
          !this.master.getServerName().equals(sn)) {
        LOG.warn("Attempted to handle region transition for server but " +
          "server is not online: " + Bytes.toString(data.getRegionName()));
        return;
      }



> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113825#comment-13113825 ] 

Hudson commented on HBASE-4446:
-------------------------------

Integrated in HBase-0.92 #17 (See [https://builds.apache.org/job/HBase-0.92/17/])
    HBASE-4446 Rolling restart RSs scenario, regions could stay in OPENING state

stack : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java


> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "Ming Ma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ming Ma updated HBASE-4446:
---------------------------

    Summary: Rolling restart RSs scenario, regions could stay in OPENING state  (was: Rolling restart RSs, region could remain in OPENING state)

> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4446) Rolling restart RS, region could stay in OPENING state

Posted by "Ming Ma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ming Ma updated HBASE-4446:
---------------------------

    Attachment: HBASE-4446-trunk.patch

Here is the fix. Timeoutmonitor can go ahead and reassign the region if the ZK state is in RS_ZK_REGION_FAILED_OPEN.

Tested on a 5 machine cluster with rolling restart of RSs for 3 hours. After the fix, no region will stay in OPENING state forever.

> Rolling restart RS, region could stay in OPENING state
> ------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4446) Rolling restart RSs, region could remain in OPENING state

Posted by "Ming Ma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ming Ma updated HBASE-4446:
---------------------------

    Summary: Rolling restart RSs, region could remain in OPENING state  (was: Rolling restart RS, region could stay in OPENING state)

> Rolling restart RSs, region could remain in OPENING state
> ---------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-4446.
--------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed to 0.92 branch and trunk.  Thanks for the patch Ming.

> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108297#comment-13108297 ] 

Ted Yu commented on HBASE-4446:
-------------------------------

+1 on patch.

> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108390#comment-13108390 ] 

Todd Lipcon commented on HBASE-4446:
------------------------------------

Waiting on ServerShutdownHandler may make sense for some states - eg if the region was CLOSING, we need to make sure that we split logs before we reassign. But I agree that many other states (OPENING, FAILED_OPEN, CLOSED), we can handle regardless of whether the RS is online or not.

> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113545#comment-13113545 ] 

stack commented on HBASE-4446:
------------------------------

@Ming want to open new issue to cover your point above "Why does AssignmentManager.handleRegion have to inforce the following condition and rely on TimeoutMonitor and ServerShutdownHandler to kick in?"

> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4446:
--------------------------

    Fix Version/s: 0.92.0

> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108396#comment-13108396 ] 

ramkrishna.s.vasudevan commented on HBASE-4446:
-----------------------------------------------

+1. Nice analysis.
We need to dig in more to find any corner scenarios like this comes up. 

> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113673#comment-13113673 ] 

Hudson commented on HBASE-4446:
-------------------------------

Integrated in HBase-TRUNK #2245 (See [https://builds.apache.org/job/HBase-TRUNK/2245/])
    HBASE-4446 Rolling restart RSs scenario, regions could stay in OPENING state

stack : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java


> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4446) Rolling restart RSs scenario, regions could stay in OPENING state

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108272#comment-13108272 ] 

Todd Lipcon commented on HBASE-4446:
------------------------------------

Hi Ming. This seems like a good fix to TimeoutMonitor. I wonder, though - why didn't the master see the initial transition to FAILED_OPEN and handle it at that time by re-assigning?

> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira