You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2010/11/23 09:01:16 UTC

[jira] Created: (HBASE-3263) Stack overflow in AssignmentManager

Stack overflow in AssignmentManager
-----------------------------------

                 Key: HBASE-3263
                 URL: https://issues.apache.org/jira/browse/HBASE-3263
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 0.90.0
            Reporter: Todd Lipcon
            Priority: Blocker
         Attachments: stackoverflow-log.txt

My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965030#action_12965030 ] 

stack commented on HBASE-3263:
------------------------------

Let me do that.  I hated that recursion thingy.

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-3263.
--------------------------

      Resolution: Fixed
        Assignee: stack
    Hadoop Flags: [Reviewed]

I committed to trunk and branch (Removed atomicinteger -- that was a little silly).  Thanks for the review Jon.

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965004#action_12965004 ] 

Jonathan Gray commented on HBASE-3263:
--------------------------------------

+1 for commit.  seems like we could do without an AtomicInteger but minor.

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965035#action_12965035 ] 

Jonathan Gray commented on HBASE-3263:
--------------------------------------

Wait!  Are you missing the return in the normal flow?  Seems like if successful sendRegionOpen we will loop.

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 3263-v3.txt, 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934770#action_12934770 ] 

Todd Lipcon commented on HBASE-3263:
------------------------------------

And also thereafter lots of these:

java.lang.NullPointerException
  at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:485)
  at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:733)
  at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
  at $Proxy8.getRegionInfo(Unknown Source)
  at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:416)
  at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:270)
  at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:322)

So somehow we borked a null into one of our maps, it seems

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3263:
-------------------------

    Fix Version/s: 0.90.0

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965044#action_12965044 ] 

Jonathan Gray commented on HBASE-3263:
--------------------------------------

Seems right.

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 3263-v3.txt, 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3263:
-------------------------

    Attachment: 3263-v3.txt

Changed recursion to loop.  Here is what I applied to trunk and 0.90 branch.

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 3263-v3.txt, 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965015#action_12965015 ] 

Todd Lipcon commented on HBASE-3263:
------------------------------------

Dare I ask why not just make it into a loop? :)

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934769#action_12934769 ] 

Todd Lipcon commented on HBASE-3263:
------------------------------------

Shortly after the StackOverflowError it also started spitting this exception:

2010-11-19 12:09:50,366 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of usertable,,1289960558114.03110b4c3c0b24fa1c920ec7669d03a6. to serverName=haus03.sf.cloudera.com,60020,1289890926773, load=(requests=0, regions=11, usedHeap=5403, maxHeap=8185), trying to assign elsewhere instead
java.lang.NullPointerException
  at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:485)
  at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:733)
  at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
  at $Proxy8.openRegion(Unknown Source)
  at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:537)
  at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:830)


> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965041#action_12965041 ] 

stack commented on HBASE-3263:
------------------------------

Duh.. thanks Jon.  Check what I just committed.
Added a break.

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 3263-v3.txt, 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3263:
-------------------------

    Attachment: 3263.txt

Patch to bound the attempts at reassign recursions. Not pretty but should prevent this runaway from happening.

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HBASE-3263:
-------------------------------

    Attachment: stackoverflow-log.txt

Here's a log showing the beginning of the runaway recursion. It goes like this until it gets a stack overflow error.

> Stack overflow in AssignmentManager
> -----------------------------------
>
>                 Key: HBASE-3263
>                 URL: https://issues.apache.org/jira/browse/HBASE-3263
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.