You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "stack (JIRA)" <ji...@apache.org> on 2007/10/09 21:52:51 UTC

[jira] Created: (HADOOP-2017) [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266

[hbase] TestRegionServerAbort failure in patch build #903 and nightly #266
--------------------------------------------------------------------------

                 Key: HADOOP-2017
                 URL: https://issues.apache.org/jira/browse/HADOOP-2017
             Project: Hadoop
          Issue Type: Bug
          Components: contrib/hbase
            Reporter: stack
            Priority: Minor


In patch build #903, the metascanner keeps trying to go to the downed server even though onlineMetaRegions has been updated w/ new location and then the metascanner just goes away (or hangs).

In nightly build #266, its a similar scenario only the remaining region servers decide to shut down because they haven't been able to reach the master in 7 seconds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-2017) [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack reassigned HADOOP-2017:
-----------------------------

    Assignee: stack

> [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-2017
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2017
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>         Attachments: trsa.patch
>
>
> In patch build #903, the metascanner keeps trying to go to the downed server even though onlineMetaRegions has been updated w/ new location and then the metascanner just goes away (or hangs).
> In nightly build #266, its a similar scenario only the remaining region servers decide to shut down because they haven't been able to reach the master in 7 seconds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2017) [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HADOOP-2017:
--------------------------

    Attachment: trsa.patch

A patch w/ more logging and thread dumping to better help what is going on, and a mechanism that notices moved regions sooner.

{code}
HADOOP-2017 TestRegionServerAbort failure in patch build #903 and nightly #266

Notice moved META regions sooner.   Also added more logging and
thread dumping once a minute when test starts to take too long
so can see where we are hung (if we are hung).

M  src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestHStoreFile.java
    Inherit from HBaseTestCase.
M  src/contrib/hbase/src/test/org/apache/hadoop/hbase/HBaseClusterTestCase.java
    (threadDumpingJoin): Added.
M  src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestRegionServerAbort.java
    Run verification in its own thread so can concurrently thread dump if
    test is going on too long.
M  src/contrib/hbase/src/test/org/apache/hadoop/hbase/DFSAbort.java
    Moved join up into parent class.
M  src/contrib/hbase/src/java/org/apache/hadoop/hbase/Chore.java
    Remove unused import.
M src/contrib/hbase/src/java/org/apache/hadoop/hbase/HMaster.java
    (MetaRegion.toString): Added.
    Added logging around assignment checking and log split.
    (MetaRegion.compareTo): Add consideration of server address.
    (numberOfMetaRegions, metaRegionsToScan, onlineMetaRegions):
      Put declaration and assignment together and made final.
    (scanOneMetaRegion): If the region is no longer in onlineMetaRegions,
    give up trying to scan.
    (unassignRootRegion): Added (Not yet finished).
{code}

> [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-2017
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2017
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>             Fix For: 0.15.0
>
>         Attachments: trsa.patch
>
>
> In patch build #903, the metascanner keeps trying to go to the downed server even though onlineMetaRegions has been updated w/ new location and then the metascanner just goes away (or hangs).
> In nightly build #266, its a similar scenario only the remaining region servers decide to shut down because they haven't been able to reach the master in 7 seconds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2017) [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HADOOP-2017:
--------------------------

    Fix Version/s: 0.15.0
           Status: Patch Available  (was: Open)

Builds locally.

> [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-2017
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2017
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>             Fix For: 0.15.0
>
>         Attachments: trsa.patch
>
>
> In patch build #903, the metascanner keeps trying to go to the downed server even though onlineMetaRegions has been updated w/ new location and then the metascanner just goes away (or hangs).
> In nightly build #266, its a similar scenario only the remaining region servers decide to shut down because they haven't been able to reach the master in 7 seconds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2017) [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HADOOP-2017:
--------------------------

       Resolution: Fixed
    Fix Version/s: 0.15.0
           Status: Resolved  (was: Patch Available)

Hasn't recurred since commit.  HADOOP-2038 should also makes this issue less likely.  Also, this test has been moved into TestRegionServerExit.  Resolving as fixed.

> [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-2017
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2017
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.15.0
>
>         Attachments: trsa.patch
>
>
> In patch build #903, the metascanner keeps trying to go to the downed server even though onlineMetaRegions has been updated w/ new location and then the metascanner just goes away (or hangs).
> In nightly build #266, its a similar scenario only the remaining region servers decide to shut down because they haven't been able to reach the master in 7 seconds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2017) [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533478 ] 

stack commented on HADOOP-2017:
-------------------------------

Nightly #263 also failed on TRSA in same manner as patch build #903

> [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-2017
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2017
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>
> In patch build #903, the metascanner keeps trying to go to the downed server even though onlineMetaRegions has been updated w/ new location and then the metascanner just goes away (or hangs).
> In nightly build #266, its a similar scenario only the remaining region servers decide to shut down because they haven't been able to reach the master in 7 seconds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2017) [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533722 ] 

Hudson commented on HADOOP-2017:
--------------------------------

Integrated in Hadoop-Nightly #267 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/267/])

> [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-2017
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2017
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>         Attachments: trsa.patch
>
>
> In patch build #903, the metascanner keeps trying to go to the downed server even though onlineMetaRegions has been updated w/ new location and then the metascanner just goes away (or hangs).
> In nightly build #266, its a similar scenario only the remaining region servers decide to shut down because they haven't been able to reach the master in 7 seconds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2017) [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533531 ] 

stack commented on HADOOP-2017:
-------------------------------

Applied patch.  Now waiting to see if problem occurs again.  If so, extra logging and thread dumps should help.

> [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-2017
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2017
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>         Attachments: trsa.patch
>
>
> In patch build #903, the metascanner keeps trying to go to the downed server even though onlineMetaRegions has been updated w/ new location and then the metascanner just goes away (or hangs).
> In nightly build #266, its a similar scenario only the remaining region servers decide to shut down because they haven't been able to reach the master in 7 seconds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2017) [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HADOOP-2017:
--------------------------

    Fix Version/s:     (was: 0.15.0)

> [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-2017
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2017
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>         Attachments: trsa.patch
>
>
> In patch build #903, the metascanner keeps trying to go to the downed server even though onlineMetaRegions has been updated w/ new location and then the metascanner just goes away (or hangs).
> In nightly build #266, its a similar scenario only the remaining region servers decide to shut down because they haven't been able to reach the master in 7 seconds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2017) [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533514 ] 

Hadoop QA commented on HADOOP-2017:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12367392/trsa.patch
against trunk revision r583037.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/910/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/910/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/910/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/910/console

This message is automatically generated.

> [hbase] TestRegionServerAbort failure in patch build #903 and nightly #266
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-2017
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2017
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>             Fix For: 0.15.0
>
>         Attachments: trsa.patch
>
>
> In patch build #903, the metascanner keeps trying to go to the downed server even though onlineMetaRegions has been updated w/ new location and then the metascanner just goes away (or hangs).
> In nightly build #266, its a similar scenario only the remaining region servers decide to shut down because they haven't been able to reach the master in 7 seconds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.