You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Zhihong Yu (Created) (JIRA)" <ji...@apache.org> on 2012/01/04 03:46:41 UTC

[jira] [Created] (HBASE-5120) Timeout monitor races with table disable handler

Timeout monitor races with table disable handler
------------------------------------------------

                 Key: HBASE-5120
                 URL: https://issues.apache.org/jira/browse/HBASE-5120
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.92.0
            Reporter: Zhihong Yu
            Priority: Blocker


Here is what J-D described here:
https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176

I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.

First the region gets closed
{quote}
2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
{quote}

2 minutes later it times out:

{quote}
2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
{quote}

100ms later the master finally gets the event:

{quote}
2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
{quote}

At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?

{quote}
2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
{quote}

Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.

The late ZK notification that the znode was deleted (but it got recreated after):

{quote}
2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
{quote}

Then it prints this, and much later tries to unassign it again:

{quote}
2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
...
2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
{quote}

And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184205#comment-13184205 ] 

Zhihong Yu commented on HBASE-5120:
-----------------------------------

Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ?
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182673#comment-13182673 ] 

Zhihong Yu commented on HBASE-5120:
-----------------------------------

I see different exception handling w.r.t. KeeperException in the changes.
{code}
+      } catch (KeeperException e) {
+        LOG.debug("CLOSING node for the region " + region.getEncodedName()
+            + " already deleted");
{code}
Can you explain why ?

For the new method, deleteClosingOrClosedNode(), I think it should utilize return value from ZKAssign.deleteNode()
Meaning, deleteClosingOrClosedNode() should return a boolean.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195163#comment-13195163 ] 

stack commented on HBASE-5120:
------------------------------

So, setting down the TM period from 30 minutes to 2 minutes and doing a bunch of online enable/disable works?  If so, I'm +1 on this patch (We'd actually set the timeout down from 30minutes to 5minutes over in hbase-5119?)
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Attachment: HBASE-5120_3.patch
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Lars Hofhansl (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179660#comment-13179660 ] 

Lars Hofhansl edited comment on HBASE-5120 at 1/4/12 5:17 PM:
--------------------------------------------------------------

Hey Ram, I know you said this is not done, yet...
{code}
if (t instanceof NullPointerException) {
  removeRegionInTransition(region);
...
{code}

Could we instead add a null check at the relevant point and deal with it there (or maybe throw another exception)? (Dealing with this after the NPE occurred leaves a bad taste... What if we introduce another bug later that also causes an NPE? That would go unnoticed for a while.)

                
      was (Author: lhofhansl):
    Hey Ram, I know you said this is not done, yet...
{code}
if (t instanceof NullPointerException) {
  removeRegionInTransition(region);
...
{code}

Could we instead add a null check at the relevant point and deal with it there (or maybe throw another exception)? (Dealing with this after the NPE occurred leaves a bad taste... What we introduce another bug later that also causes an NPE, that will go unnoticed for a while.)

                  
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210492#comment-13210492 ] 

stack commented on HBASE-5120:
------------------------------

I did not commit to 0.90 either.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179243#comment-13179243 ] 

Zhihong Yu commented on HBASE-5120:
-----------------------------------

J-D ended up issuing 'kill -9' to the master.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185519#comment-13185519 ] 

Hadoop QA commented on HBASE-5120:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510473/HBASE-5120_5.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -147 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 81 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.replication.TestReplication

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/753//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/753//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/753//console

This message is automatically generated.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180560#comment-13180560 ] 

stack commented on HBASE-5120:
------------------------------

@Ram Is odd that server is null; this should be 'impossible', right?
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180622#comment-13180622 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@Stack
Yes Stack it is impossible.  But still we need to remove the region from RIT and also delete the node CLOSING node if one was created.
By tomorrow i can submit the patch after testing.
If you have some idea you can suggest. Also correct me if am wrong.


                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183206#comment-13183206 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@Ted
I feel abort is not needed. I cannot see a place where the znode can be in another state.  Also if we do that if the znode is changed in the assign flow we still go with the deletion and will abort the master if deletion fails.

Just my thought. 
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183434#comment-13183434 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@Ted
Ok.  I will make a patch.  Before making the next patch will wait for others comments.
@Stack
Any Comments ?
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Attachment: HBASE-5120_1.patch

Patch addresses the issue.  But does not make the code cleaner as Stack wanted it to be. 
@Stack 
We may need to make the TM and unassign flow lighter. This patch addresses the problem in this JIRA.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Status: Patch Available  (was: Open)
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179321#comment-13179321 ] 

Lars Hofhansl commented on HBASE-5120:
--------------------------------------

I see, thanks Ted.
Ram, you're are becoming the assignment/split/race-condition expert... :)

                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Status: Open  (was: Patch Available)
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Attachment: HBASE-5120_5.patch
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182609#comment-13182609 ] 

Hadoop QA commented on HBASE-5120:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509921/HBASE-5120_1.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -151 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestImportTsv

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/708//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/708//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/708//console

This message is automatically generated.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179725#comment-13179725 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@Lars and @Ted
Thanks for your reviews.
I was also not convinced in adding the instanceof check for NPE.  It does not look good.
I was trying at other option of not calling sendRegionClose if the server is null.
Also remove the RIT if anything has been added.  But need to verify some more scenarios. Not sure if i can dedicate time tomorrow as we have some internal workshop.  I will work on this and give my patch sooner. :)

                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183380#comment-13183380 ] 

Zhihong Yu edited comment on HBASE-5120 at 1/10/12 5:07 PM:
------------------------------------------------------------

{code}
+        LOG.debug("The deletion of the CLOSED node returned " + deleteNode);
{code}
I think the above should be at ERROR level. Also, please log the region name.
                
      was (Author: zhihyu@ebaysf.com):
    {code}
+        LOG.debug("The deletion of the CLOSED node returned " + deleteNode);
{code}
I think the above should be at ERROR level. Also, please make the sentence syntactically correct.
                  
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Attachment: HBASE-5120.patch

Attaching the patch so that i can access it at home.  First cut versiion and not fully tested.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179278#comment-13179278 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

I will take a look at this issue.  
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179662#comment-13179662 ] 

Zhihong Yu commented on HBASE-5120:
-----------------------------------

Thanks for the quick turnaround.

The patch depends on NullPointerException to detect problematic region.
I think we should use more protective measure as Lars commented.
The patch specifies M_ZK_REGION_CLOSING for ZKAssign.deleteNode(). This narrows the applicability of the NPE handling.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182968#comment-13182968 ] 

stack commented on HBASE-5120:
------------------------------

Convention is to flip this so server == null rather than null == server:

{code}
+    if (null == server) {
{code}

Ditto further down in the code.

Regards server == null, how would this arise?  I'm not clear on how we'd have a null server and still have a region in this.regions.  If there is a legit circumstance, then this patch I can see going in -- though it does not address the core issue of our not working properly when concurrent actors, in this case the TM and a disable (I still think TM should be turned way down or off altogether).
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183989#comment-13183989 ] 

Hadoop QA commented on HBASE-5120:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510165/HBASE-5120_4.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -147 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
                  org.apache.hadoop.hbase.io.hfile.TestLruBlockCache
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestImportTsv

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/726//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/726//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/726//console

This message is automatically generated.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195417#comment-13195417 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@Stack
Ok.  Thats sound good.  But currently our clusters are running with 0.90.x.
May be it may take some more time for me to bring out the results.  By the time if some one volunteers to check this, I would be thankful to them.
Any way i would also try to check this out. Thanks Stack.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5120:
------------------------------

    Fix Version/s:     (was: 0.92.0)
                   0.92.1
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185502#comment-13185502 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@J-D and @Lars
Latest patch uploaded. Sorry for uploading with the same name.
Hope this patch addresses your concern.?
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5120:
-------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Closed.  Reopen if I have it wrong Ram.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Attachment:     (was: HBASE-5120.patch)
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180231#comment-13180231 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@Stack
Yes if the server in this.regions is null we should not go with unassign at all.  
But there is one more condition where the RIT was getting a null server destination.
That is what i wanted to address.  This patch is not final as i wanted to test some conditions with this. 

About rewrite AM is also what we thought after TimeoutMonitor refactor as adding testcase was difficult.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194962#comment-13194962 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@Stack
Can we commit this now to trunk?
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187081#comment-13187081 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

Just was browsing thro HBASE-4015 where Timeout monitor was refactored.
With 5 secs as timeout period it was tested by balancing, killing and bringing up RS. Things came out fine.
But this disable scenario was missed out.
Another change that i could see is when HBASE-4015 was done for forceful unassign() we check if the node is present in CLOSING state then we did not proceed with it.
Now in recent code the check is removed.  May be that exposed the problem.
Thanks to JD for pointing this out. As per JD if after reducing the timeout period if we don't run to such type of issues then we can say TM is really fixed.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179938#comment-13179938 ] 

stack commented on HBASE-5120:
------------------------------

When would it ever make sense that TimeoutMonitor ask a regionserver RE-close a region?  It implies that somehow the regionserver 'forgot' to do the close or that it dropped the message.  If either happened, we have bigger problems and having the TM do a new close will only make it worse (If a RS 'forgot' or 'failed' a close, its probably on its way out and its ephemeral node is about to vanish and master shutdown handler will do the close fixups; if it does not work this way, we should make it so?).

TM might make some sense for region openings but even here, I'd think the above applies; i.e. if a RS can't open a region, it should kill itself. If the region is bad, its unlikely any RS will open it unless the error transient.  For this latter case the TM running every 30 minutes, while conservative, is probably fine for a default.

I think we should leave the TM at its current 30 minute timeout and undo this issue as critical against 0.92.

There are outstanding race conditions in hbase around shutdown handler and master operations that deserve more attention that this IMO.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179660#comment-13179660 ] 

Lars Hofhansl commented on HBASE-5120:
--------------------------------------

Hey Ram, I know you said this is not done, yet...
{code}
if (t instanceof NullPointerException) {
  removeRegionInTransition(region);
...
{code}

Could we instead add a null check at the relevant point and deal with it there (or maybe throw another exception)? (Dealing with this after the NPE occurred leaves a bad taste... What we introduce another bug later that also causes an NPE, that will go unnoticed for a while.)

                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Attachment: HBASE-5120_2.patch
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184251#comment-13184251 ] 

Zhihong Yu commented on HBASE-5120:
-----------------------------------

+1 on patch v5.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Status: Open  (was: Patch Available)
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179940#comment-13179940 ] 

stack commented on HBASE-5120:
------------------------------

On the patch, Ram, should we even try unassign if server in this.regions is null; i.e. we should fail way earlier?

IMO, this AM needs a rewrite if only to make its various transitions testable.  There are also too many methods doing similar things with lots of overlap; its really hard to figure whats going on.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Attachment: HBASE-5120_4.patch
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179265#comment-13179265 ] 

Zhihong Yu commented on HBASE-5120:
-----------------------------------

I followed J-D's suggestion of keeping the original description for HBASE-5119.
The plan is once this bug is fixed, HBASE-5119 would lower default timeout.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210822#comment-13210822 ] 

Hudson commented on HBASE-5120:
-------------------------------

Integrated in HBase-TRUNK-security #114 (See [https://builds.apache.org/job/HBase-TRUNK-security/114/])
    HBASE-5120 Timeout monitor races with table disable handler (Revision 1245731)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java

                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183319#comment-13183319 ] 

Zhihong Yu commented on HBASE-5120:
-----------------------------------

I think deleteClosingOrClosedNode() should catch KeeperException and abort, considering that code appears twice in the patch.

There is no checking in deleteClosingOrClosedNode() on the return value for the second call to ZKAssign.deleteNode(). At least we should log the return value.
There might be more corner cases that we discover in the future.

Good job.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185140#comment-13185140 ] 

Lars Hofhansl commented on HBASE-5120:
--------------------------------------

+1 on v5

Nit: this is a bit weird
{code}
+      if (!ZKAssign.deleteNode(master.getZooKeeper(), region.getEncodedName(),
+          EventHandler.EventType.M_ZK_REGION_CLOSING)) {
+        boolean deleteNode = ZKAssign.deleteNode(master.getZooKeeper(), region
+            .getEncodedName(), EventHandler.EventType.RS_ZK_REGION_CLOSED);
+        // TODO : We don't abort if the delete node returns false. Is there any
+        // such corner case?
+        LOG.error("The deletion of the CLOSED node for the region "
+            + region.getEncodedName() + " returned " + deleteNode);
{code}
why not
{code}
+      boolean deleteNode = ZKAssign.deleteNode(master.getZooKeeper(), region
+            .getEncodedName(), EventHandler.EventType.RS_ZK_REGION_CLOSED);
+      if (!deleteNode) {
+        // TODO : We don't abort if the delete node returns false. Is there any
+        // such corner case?
+        LOG.error("The deletion of the CLOSED node for the region "
+            + region.getEncodedName() + " returned " + deleteNode);
{code}
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "gaojinchao (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao updated HBASE-5120:
------------------------------

    Attachment: HBASE-5120.patch

Patch is attached so that i can access it at home.  Not the final one and not fully tested in cluster.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Comment: was deleted

(was: Patch is attached so that i can access it at home.  Not the final one and not fully tested in cluster.)
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217967#comment-13217967 ] 

Lars Hofhansl commented on HBASE-5120:
--------------------------------------

Can we close this issue?
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5120:
------------------------------

    Fix Version/s: 0.94.0
                   0.92.0
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183074#comment-13183074 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@Ted
{code}
catch (KeeperException e) {

{code}
On getting KeeperException we need to abort but not on NoNodeException. I missed this in the recent patch. But i don't find the use of the return value in this case.  That is why it was not used. May be if we get a false we can only log the message. But ideally as per the analysis the node can be in CLOSING state or CLOSED state only.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5120:
-------------------------

    Fix Version/s:     (was: 0.92.1)
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Status: Patch Available  (was: Open)
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183380#comment-13183380 ] 

Zhihong Yu commented on HBASE-5120:
-----------------------------------

{code}
+        LOG.debug("The deletion of the CLOSED node returned " + deleteNode);
{code}
I think the above should be at ERROR level. Also, please make the sentence syntactically correct.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179263#comment-13179263 ] 

Lars Hofhansl commented on HBASE-5120:
--------------------------------------

HBASE-5119 is almost identical.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan reassigned HBASE-5120:
---------------------------------------------

    Assignee: ramkrishna.s.vasudevan
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183337#comment-13183337 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@Ted
Thanks for your review. I have updated the patch.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Jean-Daniel Cryans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185369#comment-13185369 ] 

Jean-Daniel Cryans commented on HBASE-5120:
-------------------------------------------

Testing the patch with a low timeout, I can answer the question in the code that asks "We don't abort if the delete node returns false. Is there any such corner case?" and yes here it is:

{noformat}
2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326415997208, server=null
2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
2012-01-13 00:53:39,053 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining)
2012-01-13 00:53:39,254 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 
2012-01-13 00:53:39,255 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: update TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326416019254, server=null the timestamp.
2012-01-13 00:53:39,256 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r12s38,62023,1326415651391 returned org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:CLOSE for the region:TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. ,which we are already trying to CLOSE. for 0784c045e00205949461cb21b8f4cd6a
2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326416019256, server=null
2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
2012-01-13 00:54:09,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining)
2012-01-13 00:54:09,126 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 
2012-01-13 00:54:09,127 INFO org.apache.hadoop.hbase.master.AssignmentManager: While trying to recover the table TestTable to DISABLED state the region {NAME => 'TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.', STARTKEY => '0006605550', ENDKEY => '0006616035', ENCODED => 0784c045e00205949461cb21b8f4cd6a,} was offlined but the table was in DISABLING state
2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for 0784c045e00205949461cb21b8f4cd6a that is in expected state M_ZK_REGION_CLOSING
2012-01-13 00:54:09,128 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Attempting to delete unassigned node 0784c045e00205949461cb21b8f4cd6a in M_ZK_REGION_CLOSING state but node is in RS_ZK_REGION_CLOSED state
2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for 0784c045e00205949461cb21b8f4cd6a that is in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,140 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r12s38,62023,1326415651391, region=0784c045e00205949461cb21b8f4cd6a
2012-01-13 00:54:09,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 0784c045e00205949461cb21b8f4cd6a from server sv4r12s38,62023,1326415651391 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
2012-01-13 00:54:09,148 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region 0784c045e00205949461cb21b8f4cd6a in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,148 ERROR org.apache.hadoop.hbase.master.AssignmentManager: The deletion of the CLOSED node for the region 0784c045e00205949461cb21b8f4cd6a returned true
2012-01-13 00:54:09,148 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r12s38,62023,1326415651391 returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. but we are not serving it for 0784c045e00205949461cb21b8f4cd6a
{noformat}

I turned out ok even if I had 7 regions that did that.

I also got the "CLOSING/CLOSED node for the region x already deleted" message, here's the context:

{noformat}
2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326415995575, server=null
2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
2012-01-13 00:53:39,145 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining)
2012-01-13 00:53:39,152 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 
2012-01-13 00:53:39,169 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: update TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326416019152, server=null the timestamp.
2012-01-13 00:53:39,172 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r27s44,62023,1326415651133 returned org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:CLOSE for the region:TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. ,which we are already trying to CLOSE. for f6c862f9c16b15b9227c2ed46865fb48
2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326416019172, server=null
2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
2012-01-13 00:54:09,070 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining)
2012-01-13 00:54:09,076 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 
2012-01-13 00:54:09,091 INFO org.apache.hadoop.hbase.master.AssignmentManager: While trying to recover the table TestTable to DISABLED state the region {NAME => 'TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.', STARTKEY => '0004613400', ENDKEY => '0004623885', ENCODED => f6c862f9c16b15b9227c2ed46865fb48,} was offlined but the table was in DISABLING state
2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r27s44,62023,1326415651133, region=f6c862f9c16b15b9227c2ed46865fb48
2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f6c862f9c16b15b9227c2ed46865fb48
2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for f6c862f9c16b15b9227c2ed46865fb48 that is in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,115 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region f6c862f9c16b15b9227c2ed46865fb48 in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for f6c862f9c16b15b9227c2ed46865fb48 that is in expected state M_ZK_REGION_CLOSING
2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: CLOSING/CLOSED node for the region f6c862f9c16b15b9227c2ed46865fb48 already deleted
2012-01-13 00:54:09,128 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r27s44,62023,1326415651133 returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. but we are not serving it for f6c862f9c16b15b9227c2ed46865fb48
{noformat}

Again it turned out fine.

I'm +1 after fixing the comment about the corner case.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Attachment: HBASE-5120_5.patch

Changed debug to error.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184203#comment-13184203 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

Latest patch available.. 
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210491#comment-13210491 ] 

stack commented on HBASE-5120:
------------------------------

I committed to trunk.  Will not commit to 0.92.  Not important enough of a bug I'd say.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185424#comment-13185424 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

@J-D
{code}
boolean deleteNode = ZKAssign.deleteNode(master.getZooKeeper(), region
            .getEncodedName(), EventHandler.EventType.RS_ZK_REGION_CLOSED);
        // TODO : We don't abort if the delete node returns false. Is there any
        // such corner case?
        LOG.error("The deletion of the CLOSED node for the region "
            + region.getEncodedName() + " returned " + deleteNode);
{code}
This piece of code should have been 
{code}
        if (!deleteNode) {
          LOG.error("The deletion of the CLOSED node for the region "
              + region.getEncodedName() + " returned " + deleteNode);
        }
}
{code}

Because as per the first set of logs the node in the expected CLOSED state only.. That is why it is returning true and also deleting the node.  
If we add an 'if' check before the log it won't log an error.



                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211138#comment-13211138 ] 

Hudson commented on HBASE-5120:
-------------------------------

Integrated in HBase-TRUNK #2665 (See [https://builds.apache.org/job/HBase-TRUNK/2665/])
    HBASE-5120 Timeout monitor races with table disable handler (Revision 1245731)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java

                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183080#comment-13183080 ] 

Zhihong Yu commented on HBASE-5120:
-----------------------------------

{code}
+        ZKAssign.deleteNode(master.getZooKeeper(), region.getEncodedName(),
+            EventHandler.EventType.RS_ZK_REGION_CLOSED);
{code}
If the above call returns false (without exception), I think we should abort.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183214#comment-13183214 ] 

Hadoop QA commented on HBASE-5120:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510034/HBASE-5120_2.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -151 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapreduce.TestImportTsv
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/716//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/716//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/716//console

This message is automatically generated.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183355#comment-13183355 ] 

Hadoop QA commented on HBASE-5120:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510054/HBASE-5120_3.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -151 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.regionserver.wal.TestLogRolling
                  org.apache.hadoop.hbase.mapreduce.TestImportTsv
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/718//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/718//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/718//console

This message is automatically generated.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Zhihong Yu (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184205#comment-13184205 ] 

Zhihong Yu edited comment on HBASE-5120 at 1/11/12 5:17 PM:
------------------------------------------------------------

Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ?
{code}
+        LOG.debug("The deletion of the CLOSED node for the region "
+            + region.getEncodedName() + " returned " + deleteNode);
{code}
                
      was (Author: zhihyu@ebaysf.com):
    Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ?
                  
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184258#comment-13184258 ] 

Hadoop QA commented on HBASE-5120:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510211/HBASE-5120_5.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -147 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapreduce.TestImportTsv
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/731//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/731//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/731//console

This message is automatically generated.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5120:
------------------------------------------

    Status: Patch Available  (was: Open)
    
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183073#comment-13183073 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
-----------------------------------------------

bq.Regards server == null, how would this arise?
This will come when the ClosedRegionHandler does 
{code}
assignmentManager.offlineDisabledRegion(regionInfo);
{code} 
inside which we do 
{code}
regionOffline(regionInfo);
{code}
{code}
public void setOffline(HRegionInfo regionInfo) {
    synchronized (this.regions) {
      ServerName sn = this.regions.remove(regionInfo);
{code}
This is where we get a null Servername.
Now coming to the case where the server is null but we have an entry in RIT. When the closedRegionHandler is not yet done, by the time TM kicks in and starts another unassign then there is a condition where the RIT is made null by the ClosedRegionHandler and by the time the forceful unassign again adds the RIT in pending_close
{code}
synchronized (regionsInTransition) {
      state = regionsInTransition.get(encodedName);
      if (state == null) {
.....
        state = new RegionState(region, RegionState.State.PENDING_CLOSE);
        regionsInTransition.put(encodedName, state);
{code}
One more condition is when balancer happens and a region about to get assigned by ClosedRegionHandler
{code}
    assignmentManager.setOffline(regionInfo);
    assignmentManager.removeClosedRegion(regionInfo);
    assignmentManager.assign(regionInfo, true);
{code}
In this case the this.regions can have a null server but as per the patch we dont remove from the RIT because we check for PENDING_CLOSE or CLOSING state. In case of balancer the state of RIT is CLOSED.

This patch addresses the scenario where TM and closedRegionHandler coming thro Disable flow and also balancer flow. Pls provide your comments.

                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
> 2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira