You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2010/11/23 09:01:16 UTC
[jira] Created: (HBASE-3263) Stack overflow in AssignmentManager
Stack overflow in AssignmentManager
-----------------------------------
Key: HBASE-3263
URL: https://issues.apache.org/jira/browse/HBASE-3263
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Blocker
Attachments: stackoverflow-log.txt
My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965030#action_12965030 ]
stack commented on HBASE-3263:
------------------------------
Let me do that. I hated that recursion thingy.
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Assignee: stack
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack resolved HBASE-3263.
--------------------------
Resolution: Fixed
Assignee: stack
Hadoop Flags: [Reviewed]
I committed to trunk and branch (Removed atomicinteger -- that was a little silly). Thanks for the review Jon.
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Assignee: stack
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965004#action_12965004 ]
Jonathan Gray commented on HBASE-3263:
--------------------------------------
+1 for commit. seems like we could do without an AtomicInteger but minor.
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965035#action_12965035 ]
Jonathan Gray commented on HBASE-3263:
--------------------------------------
Wait! Are you missing the return in the normal flow? Seems like if successful sendRegionOpen we will loop.
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Assignee: stack
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: 3263-v3.txt, 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934770#action_12934770 ]
Todd Lipcon commented on HBASE-3263:
------------------------------------
And also thereafter lots of these:
java.lang.NullPointerException
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:485)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:733)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
at $Proxy8.getRegionInfo(Unknown Source)
at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:416)
at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:270)
at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:322)
So somehow we borked a null into one of our maps, it seems
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Priority: Blocker
> Attachments: stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-3263:
-------------------------
Fix Version/s: 0.90.0
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965044#action_12965044 ]
Jonathan Gray commented on HBASE-3263:
--------------------------------------
Seems right.
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Assignee: stack
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: 3263-v3.txt, 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-3263:
-------------------------
Attachment: 3263-v3.txt
Changed recursion to loop. Here is what I applied to trunk and 0.90 branch.
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Assignee: stack
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: 3263-v3.txt, 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965015#action_12965015 ]
Todd Lipcon commented on HBASE-3263:
------------------------------------
Dare I ask why not just make it into a loop? :)
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Assignee: stack
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934769#action_12934769 ]
Todd Lipcon commented on HBASE-3263:
------------------------------------
Shortly after the StackOverflowError it also started spitting this exception:
2010-11-19 12:09:50,366 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of usertable,,1289960558114.03110b4c3c0b24fa1c920ec7669d03a6. to serverName=haus03.sf.cloudera.com,60020,1289890926773, load=(requests=0, regions=11, usedHeap=5403, maxHeap=8185), trying to assign elsewhere instead
java.lang.NullPointerException
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:485)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:733)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
at $Proxy8.openRegion(Unknown Source)
at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:537)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:830)
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Priority: Blocker
> Attachments: stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965041#action_12965041 ]
stack commented on HBASE-3263:
------------------------------
Duh.. thanks Jon. Check what I just committed.
Added a break.
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Assignee: stack
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: 3263-v3.txt, 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-3263:
-------------------------
Attachment: 3263.txt
Patch to bound the attempts at reassign recursions. Not pretty but should prevent this runaway from happening.
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Priority: Blocker
> Fix For: 0.90.0
>
> Attachments: 3263.txt, stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager
Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon updated HBASE-3263:
-------------------------------
Attachment: stackoverflow-log.txt
Here's a log showing the beginning of the runaway recursion. It goes like this until it gets a stack overflow error.
> Stack overflow in AssignmentManager
> -----------------------------------
>
> Key: HBASE-3263
> URL: https://issues.apache.org/jira/browse/HBASE-3263
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Priority: Blocker
> Attachments: stackoverflow-log.txt
>
>
> My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.