You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "gaojinchao (Created) (JIRA)" <ji...@apache.org> on 2011/11/04 13:47:00 UTC

[jira] [Created] (HBASE-4749) TestMasterFailover case occasional fails

TestMasterFailover case occasional fails
----------------------------------------

                 Key: HBASE-4749
                 URL: https://issues.apache.org/jira/browse/HBASE-4749
             Project: HBase
          Issue Type: Bug
          Components: test
    Affects Versions: 0.92.0
            Reporter: gaojinchao
            Priority: Minor
             Fix For: 0.92.0


look this logs:
https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover case occasional fails

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144107#comment-13144107 ] 

stack commented on HBASE-4749:
------------------------------

Nice work Jinchao.  I think you've pinpointed the root issue.

The registration of regionservers by the master of regionservers that have not heartbeated the master but that still have an ephemeral node up in zk looks dangerous.  It was added by hbase-1502, by me, where I purged master/regionserver control via heartbeats.

I think we need to remove this bit of code.  Looking....
                
> TestMasterFailover case occasional fails
> ----------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4749:
-------------------------

    Status: Patch Available  (was: Open)
    
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144860#comment-13144860 ] 

Hudson commented on HBASE-4749:
-------------------------------

Integrated in HBase-0.92 #114 (See [https://builds.apache.org/job/HBase-0.92/114/])
    HBASE-4749 TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

stack : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java

                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749-v2.txt, 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4749:
-------------------------

    Status: Patch Available  (was: Open)
    
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749-v2.txt, 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144603#comment-13144603 ] 

Hadoop QA commented on HBASE-4749:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12502569/4749-v2.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated -164 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 48 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.thrift2.TestThriftHBaseServiceHandler
                  org.apache.hadoop.hbase.client.TestAdmin
                  org.apache.hadoop.hbase.master.TestDistributedLogSplitting
                  org.apache.hadoop.hbase.master.TestMasterFailover
                  org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithRemove

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/183//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/183//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/183//console

This message is automatically generated.
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749-v2.txt, 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144829#comment-13144829 ] 

Hudson commented on HBASE-4749:
-------------------------------

Integrated in HBase-TRUNK #2414 (See [https://builds.apache.org/job/HBase-TRUNK/2414/])
    HBASE-4749 TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

stack : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java

                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749-v2.txt, 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144512#comment-13144512 ] 

Ted Yu commented on HBASE-4749:
-------------------------------

{code}
java.lang.NullPointerException
	at org.apache.hadoop.hbase.master.HMaster.stop(HMaster.java:1322)
	at org.apache.hadoop.hbase.master.HMaster.stopMaster(HMaster.java:1314)
{code}
Looks like this.activeMasterManager was null:
{code}
    // If we are a backup master, we need to interrupt wait
    synchronized (this.activeMasterManager.clusterHasActiveMaster) {
{code}
because becomeActiveMaster() hadn't been called yet ?
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover case occasional fails

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144182#comment-13144182 ] 

Ted Yu commented on HBASE-4749:
-------------------------------

Fourth option: use Guava ConcurrentMap for ServerManager.onlineServers where expiration time period can be adjusted by a new config parameter:
{code}
ConcurrentMap<ServerName, HServerLoad> onlineServers =
  new MapMaker().expiration(2, TimeUnit.MINUTES).evictionListener(listener).makeMap();
{code}
The registered listener would call expireServer() for the underlying region server.
In HMaster.regionServerReport(), we refresh the entry (through a call to ServerManager) for the reporting region server.

In terms of testing, the above approach may incur additional waiting period.
                
> TestMasterFailover case occasional fails
> ----------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144234#comment-13144234 ] 

Ted Yu commented on HBASE-4749:
-------------------------------

I tried option 2.
I looped 20 times TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS and didn't get failure.
Previously it was very easy to reproduce the failure.
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144526#comment-13144526 ] 

gaojinchao commented on HBASE-4749:
-----------------------------------

There is this logs "Caused by: java.io.IOException: Too many open files"

                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4749:
-------------------------

    Attachment: 4749-v2.txt

Trying again.  Patch includes check activeMaster is not null over in HMaster (as per Ted).  I think the failure because of too many files (as per Gaojinchao).  Good stuff lads.
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749-v2.txt, 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144552#comment-13144552 ] 

Ted Yu commented on HBASE-4749:
-------------------------------

I should have looked at the bottom of https://builds.apache.org/job/PreCommit-HBASE-Build/181//testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/ :
{code}
2011-11-05 01:06:29,718 ERROR [Thread-953] hbase.MiniHBaseCluster(201): Error starting cluster
java.lang.RuntimeException: Failed construction of RegionServer: class org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer
...
Caused by: java.io.IOException: Too many open files
	at sun.nio.ch.IOUtil.initPipe(Native Method)
	at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
	at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
{code}
I think we should increase ulimit on asf001.sp2.ygridcore.net
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749-v2.txt, 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "Ted Yu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4749:
--------------------------

    Priority: Critical  (was: Minor)

This issue may require changes in master code.
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144263#comment-13144263 ] 

Ted Yu commented on HBASE-4749:
-------------------------------

I think testMasterFailoverWithMockedRITOnDeadRS should be forked into two tests:
1. the aborted RS carried .META.
2. the aborted RS didn't carry .META.

This would make each test behave deterministically.
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144249#comment-13144249 ] 

stack commented on HBASE-4749:
------------------------------

Yeah, that would fix the test but we'd be left w/ hbase-4511 -- where on master failover, if root or meta verification fails because hosting server is going down... we'll miss edits.  Let me see if I can fix.
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4749:
-------------------------

    Attachment: 4749.txt

Here is a wait on regionserver that does not wait a period -- it actually waits till the RS is down.  Running the test, it seems to be working.  I'll let my tests run a bit longer...
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144442#comment-13144442 ] 

stack commented on HBASE-4749:
------------------------------

So, looking at this more, I think its correct to register servers that are up in zk but that have not reported in (and have not expired yet).  And HBASE-4511 is a real issue (I've commented over there).  Since hbase-4511 a rare issue IMO, I don't think it a blocker/critical fix needed for 0.92.

Fixing this test, lets do something basic like one of the Ted suggestions above.
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144509#comment-13144509 ] 

Hadoop QA commented on HBASE-4749:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12502545/4749.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated -164 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 48 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.master.TestMasterFailover
                  org.apache.hadoop.hbase.master.TestDistributedLogSplitting

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/181//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/181//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/181//console

This message is automatically generated.
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover case occasional fails

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143990#comment-13143990 ] 

gaojinchao commented on HBASE-4749:
-----------------------------------

It seems a bug for TRUNK.
In version 0.90, We kill a RS and at same time start a Master, Master don't add a dying RS to online set.
But in version 0.92 We will add a dying RS to online set.
This will produce a lot of unusual scenarios:
1. if the root/meta is in a dying RS, we may lose data because don't split Hlog. looks issue: https://issues.apache.org/jira/browse/HBASE-4511.
2.In testMasterFailoverWithMockedRITOnDeadRScase , mocking scenarios will be invalid.

look this logs:

//we kill this RS(1320357166142 )
2011-11-03 21:52:56,007 INFO  [Thread-986] master.TestMasterFailover(1011): 

Killing RS juno.apache.org,60001,1320357166142 

//we pick up this RS(1320357166142) through zk node.
2011-11-03 21:52:57,356 INFO  [Master:0;juno.apache.org,51313,1320357176029] master.HMaster(464): Registering server found up in zk: juno.apache.org,60001,1320357166142
2011-11-03 21:52:57,357 INFO  [Master:0;juno.apache.org,51313,1320357176029] master.ServerManager(239): Registering server=juno.apache.org,60001,1320357166142


So I think we should wait until killing RS is shut down and start a new hmaster.
                
> TestMasterFailover case occasional fails
> ----------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover case occasional fails

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144085#comment-13144085 ] 

Ted Yu commented on HBASE-4749:
-------------------------------

A third option: delete the ephemeral node for the aborted region server before starting the new master.
                
> TestMasterFailover case occasional fails
> ----------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "Ted Yu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4749:
--------------------------

    Summary: TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails  (was: TestMasterFailover case occasional fails)
    
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover case occasional fails

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144036#comment-13144036 ] 

Ted Yu commented on HBASE-4749:
-------------------------------

Thanks for the finding Jinchao.

>From log of build 105:
{code}
Killing RS juno.apache.org,60001,1320357166142 


2011-11-03 21:52:56,007 FATAL [Thread-986] regionserver.HRegionServer(1523): ABORTING region server juno.apache.org,60001,1320357166142: Killing for unit test
...
2011-11-03 21:52:56,011 WARN  [Thread-986] regionserver.HRegionServer(1545): Unable to report fatal error to master
java.lang.reflect.UndeclaredThrowableException
	at $Proxy16.reportRSFatalError(Unknown Source)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:1541)
...
2011-11-03 21:52:57,356 INFO [Master:0;juno.apache.org,51313,1320357176029] master.HMaster(464): Registering server found up in zk: juno.apache.org,60001,1320357166142
2011-11-03 21:52:57,357 INFO [Master:0;juno.apache.org,51313,1320357176029] master.ServerManager(239): Registering server=juno.apache.org,60001,1320357166142
...
2011-11-03 21:52:57,586 INFO  [Thread-986-EventThread] zookeeper.RegionServerTracker(93): RegionServer ephemeral node deleted, processing expiration [juno.apache.org,60001,1320357166142]
2011-11-03 21:52:57,588 INFO  [RegionServer:1;juno.apache.org,60001,1320357166142] regionserver.HRegionServer(744): stopping server juno.apache.org,60001,1320357166142; zookeeper connection closed.
{code}
We can see that there was 570ms delay for the completion of region server shutdown handler. That was why re-registration of the dead region server happened.

Since reportRSFatalError() encountered exception, we cannot rely on this callback to reach master.

We have two options:
1. devise a mechanism to tell the new master the identity of the dead region server
2. insert a sleep of say 1 second before starting the new master

Option 1 introduces extra complexity into Master. I am not sure if it is worth it just for test purposes.
Many people wouldn't like option 2.

More discussion is welcome.
                
> TestMasterFailover case occasional fails
> ----------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4749:
-------------------------

      Resolution: Fixed
        Assignee: stack
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Committed branch and trunk.
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749-v2.txt, 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4749:
-------------------------

    Status: Open  (was: Patch Available)
    
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749-v2.txt, 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144560#comment-13144560 ] 

stack commented on HBASE-4749:
------------------------------

bq. I think we should increase ulimit on asf001.sp2.ygridcore.net

I added printing of the ulimit and it seems to have good numbers for files:

{code}
asf001.sp2.ygridcore.net
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2048
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
32768
Running in Jenkins mode
{code}

Must be something else we need to up.
                
> TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 4749-v2.txt, 4749.txt
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira