You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org> on 2012/06/01 15:12:37 UTC

[jira] [Created] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

ramkrishna.s.vasudevan created HBASE-6147:
---------------------------------------------

             Summary: SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
                 Key: HBASE-6147
                 URL: https://issues.apache.org/jira/browse/HBASE-6147
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.94.0, 0.92.1
            Reporter: ramkrishna.s.vasudevan
             Fix For: 0.92.2, 0.96.0, 0.94.1


We are facing few issues in the master restart and SSH going in parallel.
Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

Posted by "chunhui shen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288326#comment-13288326 ] 

chunhui shen commented on HBASE-6147:
-------------------------------------

@ram
We has already found and fix many case for SSH and AM.joinCluster, however it seems exist many other cases all the same.

I first give a suggestion just mentioned in another issue:

Don't assign user regions in SSH until master is initialized, just like doing the following
{code}
process(){
...

if (isCarryingRoot() || isCarryingMeta()){...}

...

    int waitedTimeForMasterInitialized = 0;
    while (!server.isStopped() && !services.isInitialized()) {
      try {
        if (waitedTimeForMasterInitialized == 0) {
          LOG.info("Master is not initialized, waiting...");
        }
        Thread.sleep(100);
        waitedTimeForMasterInitialized += 100;
      } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new IOException("Interrupted", e);
      }
    }
    if (waitedTimeForMasterInitialized > 0) {
      LOG.info("Recovery time calculation: waiting on master to be initialized took "
          + waitedTimeForMasterInitialized + "ms");
    }
...
}
{code}

In some cases, above code will increase recovery time, if we could fix many cases caused by SSH and AM.joinCluster, I think it is valuable.

Correct me if wrong, thanks.
                
> SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-6147
>                 URL: https://issues.apache.org/jira/browse/HBASE-6147
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.3
>
>
> We are facing few issues in the master restart and SSH going in parallel.
> Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

Posted by "Zhihong Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289455#comment-13289455 ] 

Zhihong Ted Yu commented on HBASE-6147:
---------------------------------------

Nice start.
{code}
+          Thread.sleep(100);
+          waitedTimeForMasterInitialized += 100;
{code}
We don't know how long sleep() call may actually have taken. Better maintain timing ourselves.
{code}
+          Thread.currentThread().interrupt();
+          throw new IOException("Interrupted", e);
{code}
InterruptedIOException should be created above.
                
> SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-6147
>                 URL: https://issues.apache.org/jira/browse/HBASE-6147
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.3
>
>         Attachments: HBASE-6147.patch
>
>
> We are facing few issues in the master restart and SSH going in parallel.
> Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288343#comment-13288343 ] 

ramkrishna.s.vasudevan commented on HBASE-6147:
-----------------------------------------------

@Chunhui
Definitely your suggestion is also to be done.  All related changes w.r.t to this scenario can be addressed here. Good on you Chunhui.
                
> SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-6147
>                 URL: https://issues.apache.org/jira/browse/HBASE-6147
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.3
>
>
> We are facing few issues in the master restart and SSH going in parallel.
> Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

Posted by "Zhihong Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291139#comment-13291139 ] 

Zhihong Ted Yu commented on HBASE-6147:
---------------------------------------

In testing phase, an option may be introduced to enable the following:
{code}
+      waitTillMasterInitialized();
{code}
so that we can compare performance difference.
                
> SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-6147
>                 URL: https://issues.apache.org/jira/browse/HBASE-6147
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.3
>
>         Attachments: HBASE-6147.patch, HBASE-6147_trunk.patch
>
>
> We are facing few issues in the master restart and SSH going in parallel.
> Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291073#comment-13291073 ] 

ramkrishna.s.vasudevan commented on HBASE-6147:
-----------------------------------------------

This patch does not update the comments.  But just to show the changes that we need to make so that this problem is solved.  But for this to happen HBASE-6060 should go in and for trunk HBASe-6012 should go in.  
How HBASe-6060 helps and how Chunhui's suggestion of waiting for master initialization helps is explained below
-> Now all the assignments that happen during which if any RS goes down things will be handled by HBASE-6060.
-> Taking the case of join cluster and SSH
Following scenarios to be considered
1> Clean cluster start up
2> Partially clean start up

In the case of clean cluster start up, we do bulk assign.  Now while doing this if any RS goes down, as per Chunhui's suggestion we will wait for the master to initialize.
Now by this time the region plan would be populated considering the dead server by bulk assign.  So when the master completes initialization, the SSH will see that few regions are there in regionplan with the dead server and so the new logic introduced in HBASE-6060 will go ahead with assignment.  no waiting needed.

For the 2nd case, if by the time the ProcessRIT decides to process the node the server would be dead, so may be previously
{code}
      addToRITandCallClose(regionInfo, RegionState.State.OFFLINE, rt);
          break;
        }

        regionsInTransition.put(encodedRegionName,
          getRegionState(regionInfo, RegionState.State.OPENING, rt));
        failoverProcessedRegions.put(encodedRegionName, regionInfo);
{code}
we were just populating to OPENING in the RIT map.  But there would be no one to process this.  Now as per the latest patch we just add a region plan.
Now even if the server goes down and SSH tries to process he will see the regionplan(with HBASE-6060 and Chunhui's suggestion) and immediately trigger assignment. 
We found that even for 'RS_ZK_REGION_OPENED' this may be needed.  
We will also do a cluster testing.
Please review and provide your comments.  Hope with these changes we need not depend on timeout monitor.
@Chunhui
Please provide your thoughts on this.  It would be nice if you can also test these patches HBASE-6147, HBASE-6060 and HBASE-6012 together.





                
> SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-6147
>                 URL: https://issues.apache.org/jira/browse/HBASE-6147
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.3
>
>         Attachments: HBASE-6147.patch, HBASE-6147_trunk.patch
>
>
> We are facing few issues in the master restart and SSH going in parallel.
> Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289171#comment-13289171 ] 

stack commented on HBASE-6147:
------------------------------

Haven't looked at patch yet but this seems like good direction lads.


 





                
> SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-6147
>                 URL: https://issues.apache.org/jira/browse/HBASE-6147
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.3
>
>         Attachments: HBASE-6147.patch
>
>
> We are facing few issues in the master restart and SSH going in parallel.
> Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287410#comment-13287410 ] 

ramkrishna.s.vasudevan commented on HBASE-6147:
-----------------------------------------------

We got the following case
-> Initially we had 2 RS and 1 Master with few regions
-> Stopped the cluster and restarted the master and 2 RS.
-> One of the RS znode was not yet deleted but the master started coming up.
-> Here we will now see that there is a server which dead and not yet expired so we wil call expireServer which inturn calls SSH.
-> After this the master sees this as a clean cluster startup.
-> Now SSH triggers one assignment and master startup starts bulk assignment.
-> Now when the znode is present already the Bulk assignment will make the master go down.
So we need to handle such cases.  Solving this should help us to solve most of the double assignment cases.  There can be more such scenarios.
                
> SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-6147
>                 URL: https://issues.apache.org/jira/browse/HBASE-6147
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.96.0, 0.94.1
>
>
> We are facing few issues in the master restart and SSH going in parallel.
> Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-6147:
------------------------------------------

    Attachment: HBASE-6147_trunk.patch
    
> SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-6147
>                 URL: https://issues.apache.org/jira/browse/HBASE-6147
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.3
>
>         Attachments: HBASE-6147.patch, HBASE-6147_trunk.patch
>
>
> We are facing few issues in the master restart and SSH going in parallel.
> Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

Posted by "chunhui shen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

chunhui shen updated HBASE-6147:
--------------------------------

    Attachment: HBASE-6147.patch

Making a patch as a beginning
                
> SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-6147
>                 URL: https://issues.apache.org/jira/browse/HBASE-6147
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.3
>
>         Attachments: HBASE-6147.patch
>
>
> We are facing few issues in the master restart and SSH going in parallel.
> Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289450#comment-13289450 ] 

ramkrishna.s.vasudevan commented on HBASE-6147:
-----------------------------------------------

@Chunhui
Pls take a look at HBASE-6060.  With this patch that you have given and HBASE-6060 the problem that i mentioned in
https://issues.apache.org/jira/browse/HBASE-5916?focusedCommentId=13283349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13283349 
will not happen in 0.94.  But for trunk as it goes with bulk assign we need to check.  Good start anyway.

                
> SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-6147
>                 URL: https://issues.apache.org/jira/browse/HBASE-6147
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.3
>
>         Attachments: HBASE-6147.patch
>
>
> We are facing few issues in the master restart and SSH going in parallel.
> Chunhui also suggested that we need to rework on this part.  This JIRA is aimed at solving all such possibilities of region assignment inconsistency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira