You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2011/06/16 11:16:01 UTC

[jira] [Created] (HBASE-3995) HBASE-3946 broke TestMasterFailover

HBASE-3946 broke TestMasterFailover
-----------------------------------

                 Key: HBASE-3995
                 URL: https://issues.apache.org/jira/browse/HBASE-3995
             Project: HBase
          Issue Type: Bug
            Reporter: stack
            Assignee: stack
            Priority: Blocker
             Fix For: 0.92.0


TestMasterFailover is all about a new master coming up on an existing cluster.  Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents.  We don't want that.

But TestMasterFailover mocks up some pretty interesting conditions.  The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state.  We'd then go and disable the table up in zk (while master was offline).  Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster.  It had to figure it out.

Before HBASE-3946, we'd just force offline every region that had been on the dead server.  This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all "worked" (except would online parent of a split should there be one).



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3995) HBASE-3946 broke TestMasterFailover

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056279#comment-13056279 ] 

stack commented on HBASE-3995:
------------------------------

Thank you Gao.  I removed the redundant check.

> HBASE-3946 broke TestMasterFailover
> -----------------------------------
>
>                 Key: HBASE-3995
>                 URL: https://issues.apache.org/jira/browse/HBASE-3995
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: am.txt
>
>
> TestMasterFailover is all about a new master coming up on an existing cluster.  Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents.  We don't want that.
> But TestMasterFailover mocks up some pretty interesting conditions.  The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state.  We'd then go and disable the table up in zk (while master was offline).  Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster.  It had to figure it out.
> Before HBASE-3946, we'd just force offline every region that had been on the dead server.  This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all "worked" (except would online parent of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3995) HBASE-3946 broke TestMasterFailover

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3995:
-------------------------

    Attachment: am.txt

Patch that passes the list of dead servers down to the location where we process whats up in zookeeper at time of new master's joining a cluster; the dead servers can be used to figure if a RIT came from a dead server and if so, we know there is no point in waiting on a CLOSING to complete or, if a disabled table, OPEN should go back and try and close the region that just OPENED on a server that just died.

> HBASE-3946 broke TestMasterFailover
> -----------------------------------
>
>                 Key: HBASE-3995
>                 URL: https://issues.apache.org/jira/browse/HBASE-3995
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: am.txt
>
>
> TestMasterFailover is all about a new master coming up on an existing cluster.  Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents.  We don't want that.
> But TestMasterFailover mocks up some pretty interesting conditions.  The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state.  We'd then go and disable the table up in zk (while master was offline).  Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster.  It had to figure it out.
> Before HBASE-3946, we'd just force offline every region that had been on the dead server.  This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all "worked" (except would online parent of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3995) HBASE-3946 broke TestMasterFailover

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052752#comment-13052752 ] 

stack commented on HBASE-3995:
------------------------------

This looks to have fixed the issue in TRUNK going by fact that this test does not fail in TRUNK anymore.  Let me backport.

> HBASE-3946 broke TestMasterFailover
> -----------------------------------
>
>                 Key: HBASE-3995
>                 URL: https://issues.apache.org/jira/browse/HBASE-3995
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: am.txt
>
>
> TestMasterFailover is all about a new master coming up on an existing cluster.  Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents.  We don't want that.
> But TestMasterFailover mocks up some pretty interesting conditions.  The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state.  We'd then go and disable the table up in zk (while master was offline).  Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster.  It had to figure it out.
> Before HBASE-3946, we'd just force offline every region that had been on the dead server.  This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all "worked" (except would online parent of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3995) HBASE-3946 broke TestMasterFailover

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050315#comment-13050315 ] 

stack commented on HBASE-3995:
------------------------------

Committed.  Can then see in morning if it actually fixes build.

> HBASE-3946 broke TestMasterFailover
> -----------------------------------
>
>                 Key: HBASE-3995
>                 URL: https://issues.apache.org/jira/browse/HBASE-3995
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: am.txt
>
>
> TestMasterFailover is all about a new master coming up on an existing cluster.  Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents.  We don't want that.
> But TestMasterFailover mocks up some pretty interesting conditions.  The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state.  We'd then go and disable the table up in zk (while master was offline).  Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster.  It had to figure it out.
> Before HBASE-3946, we'd just force offline every region that had been on the dead server.  This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all "worked" (except would online parent of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3995) HBASE-3946 broke TestMasterFailover

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055406#comment-13055406 ] 

gaojinchao commented on HBASE-3995:
-----------------------------------

Hi, stack.
Following code snippet is repeated
if (storedInfo == null) 


 if (storedInfo == null) {
      .......
      if (storedInfo == null) {
        storedInfo = this.onlineServers.get(info.getServerName());
      }
    }

> HBASE-3946 broke TestMasterFailover
> -----------------------------------
>
>                 Key: HBASE-3995
>                 URL: https://issues.apache.org/jira/browse/HBASE-3995
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: am.txt
>
>
> TestMasterFailover is all about a new master coming up on an existing cluster.  Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents.  We don't want that.
> But TestMasterFailover mocks up some pretty interesting conditions.  The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state.  We'd then go and disable the table up in zk (while master was offline).  Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster.  It had to figure it out.
> Before HBASE-3946, we'd just force offline every region that had been on the dead server.  This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all "worked" (except would online parent of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3995) HBASE-3946 broke TestMasterFailover

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050660#comment-13050660 ] 

stack commented on HBASE-3995:
------------------------------

Jenkins is down for me at least.

> HBASE-3946 broke TestMasterFailover
> -----------------------------------
>
>                 Key: HBASE-3995
>                 URL: https://issues.apache.org/jira/browse/HBASE-3995
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: am.txt
>
>
> TestMasterFailover is all about a new master coming up on an existing cluster.  Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents.  We don't want that.
> But TestMasterFailover mocks up some pretty interesting conditions.  The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state.  We'd then go and disable the table up in zk (while master was offline).  Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster.  It had to figure it out.
> Before HBASE-3946, we'd just force offline every region that had been on the dead server.  This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all "worked" (except would online parent of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3995) HBASE-3946 broke TestMasterFailover

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3995:
-------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.92.0)
                   0.90.4
           Status: Resolved  (was: Patch Available)

Committed branch and trunk.

> HBASE-3946 broke TestMasterFailover
> -----------------------------------
>
>                 Key: HBASE-3995
>                 URL: https://issues.apache.org/jira/browse/HBASE-3995
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: am.txt
>
>
> TestMasterFailover is all about a new master coming up on an existing cluster.  Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents.  We don't want that.
> But TestMasterFailover mocks up some pretty interesting conditions.  The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state.  We'd then go and disable the table up in zk (while master was offline).  Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster.  It had to figure it out.
> Before HBASE-3946, we'd just force offline every region that had been on the dead server.  This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all "worked" (except would online parent of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3995) HBASE-3946 broke TestMasterFailover

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3995:
-------------------------

    Status: Patch Available  (was: Open)

I'd like to just commit this and let jenkins figure out if we've fixed this build fail issue (Will need to do a version for 0.90 too).

> HBASE-3946 broke TestMasterFailover
> -----------------------------------
>
>                 Key: HBASE-3995
>                 URL: https://issues.apache.org/jira/browse/HBASE-3995
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: am.txt
>
>
> TestMasterFailover is all about a new master coming up on an existing cluster.  Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents.  We don't want that.
> But TestMasterFailover mocks up some pretty interesting conditions.  The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state.  We'd then go and disable the table up in zk (while master was offline).  Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster.  It had to figure it out.
> Before HBASE-3946, we'd just force offline every region that had been on the dead server.  This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all "worked" (except would online parent of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3995) HBASE-3946 broke TestMasterFailover

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056853#comment-13056853 ] 

Hudson commented on HBASE-3995:
-------------------------------

Integrated in HBase-TRUNK #1995 (See [https://builds.apache.org/job/HBase-TRUNK/1995/])
    

> HBASE-3946 broke TestMasterFailover
> -----------------------------------
>
>                 Key: HBASE-3995
>                 URL: https://issues.apache.org/jira/browse/HBASE-3995
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: am.txt
>
>
> TestMasterFailover is all about a new master coming up on an existing cluster.  Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents.  We don't want that.
> But TestMasterFailover mocks up some pretty interesting conditions.  The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state.  We'd then go and disable the table up in zk (while master was offline).  Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster.  It had to figure it out.
> Before HBASE-3946, we'd just force offline every region that had been on the dead server.  This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all "worked" (except would online parent of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira