You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jonathan Gray (JIRA)" <ji...@apache.org> on 2010/09/30 20:45:33 UTC

[jira] Created: (HBASE-3057) Race condition when closing regions that causes flakiness in TestRestartCluster

Race condition when closing regions that causes flakiness in TestRestartCluster
-------------------------------------------------------------------------------

                 Key: HBASE-3057
                 URL: https://issues.apache.org/jira/browse/HBASE-3057
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.0
            Reporter: Jonathan Gray
            Assignee: Jonathan Gray
             Fix For: 0.90.0


In {{TestRestartCluster.testClusterRestart()}} we spin up cluster, create three tables, shut it down, start it back up, and ensure we still have three regions.

A subtle race condition during the first shutdown makes it so the flush of META doesn't finish so when we start back up there are no user regions.

I'm not sure if there are reasons the ordering is as such, but the is the section of code in CloseRegionHandler around line 118:

{noformat}
      this.rsServices.removeFromOnlineRegions(regionInfo.getEncodedName());
      region.close(abort);
{noformat}

We remove from the online map of regions before actually closing.  But what the main run() loop in the RS is waiting on to determine when it can shut down is that the online region map is empty.

{noformat}
  private void waitOnAllRegionsToClose() {
    // Wait till all regions are closed before going out.
    int lastCount = -1;
    while (!this.onlineRegions.isEmpty()) {
{noformat}

Any reason not to swap these two and do the close before removing from online regions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-3057) Race condition when closing regions that causes flakiness in TestRestartCluster

Posted by "Lars Francke (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Francke updated HBASE-3057:
--------------------------------

    Attachment: lars-stacktrace.2.txt
                lars-stacktrace.1.txt

I'm not sure if this is the same problem. If not I'll open another issue.

testClusterRestart fails for me most of the time. I've attached two logs. The first is unaltered trunk, the second again slightly modified (so line numbers etc. won't match up) but a slightly different error.

> Race condition when closing regions that causes flakiness in TestRestartCluster
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3057
>                 URL: https://issues.apache.org/jira/browse/HBASE-3057
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3057-v1.patch, lars-stacktrace.1.txt, lars-stacktrace.2.txt
>
>
> In {{TestRestartCluster.testClusterRestart()}} we spin up cluster, create three tables, shut it down, start it back up, and ensure we still have three regions.
> A subtle race condition during the first shutdown makes it so the flush of META doesn't finish so when we start back up there are no user regions.
> I'm not sure if there are reasons the ordering is as such, but the is the section of code in CloseRegionHandler around line 118:
> {noformat}
>       this.rsServices.removeFromOnlineRegions(regionInfo.getEncodedName());
>       region.close(abort);
> {noformat}
> We remove from the online map of regions before actually closing.  But what the main run() loop in the RS is waiting on to determine when it can shut down is that the online region map is empty.
> {noformat}
>   private void waitOnAllRegionsToClose() {
>     // Wait till all regions are closed before going out.
>     int lastCount = -1;
>     while (!this.onlineRegions.isEmpty()) {
> {noformat}
> Any reason not to swap these two and do the close before removing from online regions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HBASE-3057) Race condition when closing regions that causes flakiness in TestRestartCluster

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray resolved HBASE-3057.
----------------------------------

    Hadoop Flags: [Reviewed]
      Resolution: Fixed

Committed to trunk

> Race condition when closing regions that causes flakiness in TestRestartCluster
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3057
>                 URL: https://issues.apache.org/jira/browse/HBASE-3057
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3057-v1.patch
>
>
> In {{TestRestartCluster.testClusterRestart()}} we spin up cluster, create three tables, shut it down, start it back up, and ensure we still have three regions.
> A subtle race condition during the first shutdown makes it so the flush of META doesn't finish so when we start back up there are no user regions.
> I'm not sure if there are reasons the ordering is as such, but the is the section of code in CloseRegionHandler around line 118:
> {noformat}
>       this.rsServices.removeFromOnlineRegions(regionInfo.getEncodedName());
>       region.close(abort);
> {noformat}
> We remove from the online map of regions before actually closing.  But what the main run() loop in the RS is waiting on to determine when it can shut down is that the online region map is empty.
> {noformat}
>   private void waitOnAllRegionsToClose() {
>     // Wait till all regions are closed before going out.
>     int lastCount = -1;
>     while (!this.onlineRegions.isEmpty()) {
> {noformat}
> Any reason not to swap these two and do the close before removing from online regions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-3057) Race condition when closing regions that causes flakiness in TestRestartCluster

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3057:
---------------------------------

    Attachment: HBASE-3057-v1.patch

> Race condition when closing regions that causes flakiness in TestRestartCluster
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3057
>                 URL: https://issues.apache.org/jira/browse/HBASE-3057
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3057-v1.patch
>
>
> In {{TestRestartCluster.testClusterRestart()}} we spin up cluster, create three tables, shut it down, start it back up, and ensure we still have three regions.
> A subtle race condition during the first shutdown makes it so the flush of META doesn't finish so when we start back up there are no user regions.
> I'm not sure if there are reasons the ordering is as such, but the is the section of code in CloseRegionHandler around line 118:
> {noformat}
>       this.rsServices.removeFromOnlineRegions(regionInfo.getEncodedName());
>       region.close(abort);
> {noformat}
> We remove from the online map of regions before actually closing.  But what the main run() loop in the RS is waiting on to determine when it can shut down is that the online region map is empty.
> {noformat}
>   private void waitOnAllRegionsToClose() {
>     // Wait till all regions are closed before going out.
>     int lastCount = -1;
>     while (!this.onlineRegions.isEmpty()) {
> {noformat}
> Any reason not to swap these two and do the close before removing from online regions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3057) Race condition when closing regions that causes flakiness in TestRestartCluster

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916603#action_12916603 ] 

stack commented on HBASE-3057:
------------------------------

+1  Looks like mistake on my part (I love unit tests).

> Race condition when closing regions that causes flakiness in TestRestartCluster
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3057
>                 URL: https://issues.apache.org/jira/browse/HBASE-3057
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0
>
>
> In {{TestRestartCluster.testClusterRestart()}} we spin up cluster, create three tables, shut it down, start it back up, and ensure we still have three regions.
> A subtle race condition during the first shutdown makes it so the flush of META doesn't finish so when we start back up there are no user regions.
> I'm not sure if there are reasons the ordering is as such, but the is the section of code in CloseRegionHandler around line 118:
> {noformat}
>       this.rsServices.removeFromOnlineRegions(regionInfo.getEncodedName());
>       region.close(abort);
> {noformat}
> We remove from the online map of regions before actually closing.  But what the main run() loop in the RS is waiting on to determine when it can shut down is that the online region map is empty.
> {noformat}
>   private void waitOnAllRegionsToClose() {
>     // Wait till all regions are closed before going out.
>     int lastCount = -1;
>     while (!this.onlineRegions.isEmpty()) {
> {noformat}
> Any reason not to swap these two and do the close before removing from online regions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.