You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (Created) (JIRA)" <ji...@apache.org> on 2011/11/26 00:09:39 UTC

[jira] [Created] (HBASE-4872) Balancer goes to move a region but its being split so we fail offlining because NodeExists and it crashes master

Balancer goes to move a region but its being split so we fail offlining because NodeExists and it crashes master
----------------------------------------------------------------------------------------------------------------

                 Key: HBASE-4872
                 URL: https://issues.apache.org/jira/browse/HBASE-4872
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.92.0
            Reporter: stack
            Priority: Critical
             Fix For: 0.92.0


So, offlining should deal with there being an existing znode in SPLITTING of SPLIT state.  Here is log around the master crash.

{code}
// BALANCER RUNS
2011-11-24 07:41:18,686 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051., src=sv4r9s38,7003,1322110570576, dest=sv4r31s44,7003,1322110570645
2011-11-24 07:41:18,686 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051. (offlining)
2011-11-24 07:41:18,686 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:7001-0x133d3ee451d0000 Creating unassigned node for f681f38f85f6aad594f24f90559da051 in a CLOSING state 
2011-11-24 07:41:18,702 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/unassigned/f681f38f85f6aad594f24f90559da051 already exists and this is not a retry

2011-11-24 07:41:18,702 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
2011-11-24 07:41:18,704 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected ZK exception creating node CLOSING
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /hbase/unassigned/f681f38f85f6aad594f24f90559da051
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:459)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:441)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndWatch(ZKUtil.java:839)
        at org.apache.hadoop.hbase.zookeeper.ZKAssign.createNodeClosing(ZKAssign.java:568)
        at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1766)
        at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1732)
        at org.apache.hadoop.hbase.master.AssignmentManager.balance(AssignmentManager.java:2763)
        at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:871)
        at org.apache.hadoop.hbase.master.HMaster$1.chore(HMaster.java:735)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:67)
        at java.lang.Thread.run(Thread.java:662)
2011-11-24 07:41:18,706 INFO org.apache.hadoop.hbase.master.HMaster: Aborting


// HERE THE SPLIT SHOWS UP ON MASTER


2011-11-24 07:41:19,310 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_SPLIT, server=sv4r9s38,7003,1322110570576, region=f681f38f85f6aad594f24f90559da051
2011-11-24 07:41:19,311 INFO org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region f681f38f85f6aad594f24f90559da051 from server sv4r9s38,7003,1322110570576 but region was not first in SPLITTING state; continuing
2011-11-24 07:41:19,311 DEBUG org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handling SPLIT event for f681f38f85f6aad594f24f90559da051; deleting node
2011-11-24 07:41:19,311 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:7001-0x133d3ee451d0000 Deleting existing unassigned node for f681f38f85f6aad594f24f90559da051 that is in expected state RS_ZK_REGION_SPLIT
2011-11-24 07:41:19,401 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:7001-0x133d3ee451d0000 Successfully deleted unassigned node for region f681f38f85f6aad594f24f90559da051 in expected state RS_ZK_REGION_SPLIT
2011-11-24 07:41:19,401 INFO org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT report); parent=TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051. daughter a=TestTable,0936288217,1322120475927.98808642d25058c930fc9e5974f3715d.daughter b=TestTable,0936327449,1322120475927.00ba03057ddc6d3868fb6ef349cb9a2c.
{code}

Here is regionserver side of things at time of split:

{code}
2011-11-24 07:41:15,927 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051..  compaction_queue=(1:0), split_queue=1
2011-11-24 07:41:15,928 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051.
2011-11-24 07:41:15,928 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: regionserver:7003-0x133d3ee451d000c Creating ephemeral node for f681f38f85f6aad594f24f90559da051 in SPLITTING state
2011-11-24 07:41:15,945 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133d3ee451d000c Attempting to transition node f681f38f85f6aad594f24f90559da051 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING
2011-11-24 07:41:15,953 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133d3ee451d000c Successfully transitioned node f681f38f85f6aad594f24f90559da051 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING

...

2011-11-24 07:41:19,401 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133d3ee451d000c Attempt to transition the unassigned node for f681f38f85f6aad594f24f90559da051 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT failed, the node existed and was in the expected state but then when setting data it no longer existed



{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (HBASE-4872) Balancer goes to move a region but its being split so we fail offlining because NodeExists and it crashes master

Posted by "stack (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-4872.
--------------------------

    Resolution: Duplicate

Closing as duplicate of HBASE-4729 (Thanks Ted).
                
> Balancer goes to move a region but its being split so we fail offlining because NodeExists and it crashes master
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4872
>                 URL: https://issues.apache.org/jira/browse/HBASE-4872
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> So, offlining should deal with there being an existing znode in SPLITTING of SPLIT state.  Here is log around the master crash.
> {code}
> // BALANCER RUNS
> 2011-11-24 07:41:18,686 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051., src=sv4r9s38,7003,1322110570576, dest=sv4r31s44,7003,1322110570645
> 2011-11-24 07:41:18,686 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051. (offlining)
> 2011-11-24 07:41:18,686 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:7001-0x133d3ee451d0000 Creating unassigned node for f681f38f85f6aad594f24f90559da051 in a CLOSING state 
> 2011-11-24 07:41:18,702 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/unassigned/f681f38f85f6aad594f24f90559da051 already exists and this is not a retry
> 2011-11-24 07:41:18,702 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
> 2011-11-24 07:41:18,704 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected ZK exception creating node CLOSING
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /hbase/unassigned/f681f38f85f6aad594f24f90559da051
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:459)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:441)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndWatch(ZKUtil.java:839)
>         at org.apache.hadoop.hbase.zookeeper.ZKAssign.createNodeClosing(ZKAssign.java:568)
>         at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1766)
>         at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1732)
>         at org.apache.hadoop.hbase.master.AssignmentManager.balance(AssignmentManager.java:2763)
>         at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:871)
>         at org.apache.hadoop.hbase.master.HMaster$1.chore(HMaster.java:735)
>         at org.apache.hadoop.hbase.Chore.run(Chore.java:67)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-11-24 07:41:18,706 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> // HERE THE SPLIT SHOWS UP ON MASTER
> 2011-11-24 07:41:19,310 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_SPLIT, server=sv4r9s38,7003,1322110570576, region=f681f38f85f6aad594f24f90559da051
> 2011-11-24 07:41:19,311 INFO org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region f681f38f85f6aad594f24f90559da051 from server sv4r9s38,7003,1322110570576 but region was not first in SPLITTING state; continuing
> 2011-11-24 07:41:19,311 DEBUG org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handling SPLIT event for f681f38f85f6aad594f24f90559da051; deleting node
> 2011-11-24 07:41:19,311 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:7001-0x133d3ee451d0000 Deleting existing unassigned node for f681f38f85f6aad594f24f90559da051 that is in expected state RS_ZK_REGION_SPLIT
> 2011-11-24 07:41:19,401 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:7001-0x133d3ee451d0000 Successfully deleted unassigned node for region f681f38f85f6aad594f24f90559da051 in expected state RS_ZK_REGION_SPLIT
> 2011-11-24 07:41:19,401 INFO org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT report); parent=TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051. daughter a=TestTable,0936288217,1322120475927.98808642d25058c930fc9e5974f3715d.daughter b=TestTable,0936327449,1322120475927.00ba03057ddc6d3868fb6ef349cb9a2c.
> {code}
> Here is regionserver side of things at time of split:
> {code}
> 2011-11-24 07:41:15,927 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051..  compaction_queue=(1:0), split_queue=1
> 2011-11-24 07:41:15,928 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051.
> 2011-11-24 07:41:15,928 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: regionserver:7003-0x133d3ee451d000c Creating ephemeral node for f681f38f85f6aad594f24f90559da051 in SPLITTING state
> 2011-11-24 07:41:15,945 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133d3ee451d000c Attempting to transition node f681f38f85f6aad594f24f90559da051 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING
> 2011-11-24 07:41:15,953 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133d3ee451d000c Successfully transitioned node f681f38f85f6aad594f24f90559da051 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING
> ...
> 2011-11-24 07:41:19,401 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133d3ee451d000c Attempt to transition the unassigned node for f681f38f85f6aad594f24f90559da051 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT failed, the node existed and was in the expected state but then when setting data it no longer existed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4872) Balancer goes to move a region but its being split so we fail offlining because NodeExists and it crashes master

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157333#comment-13157333 ] 

Ted Yu commented on HBASE-4872:
-------------------------------

This seems similar to HBASE-4729

Maybe we should introduce interlock between region movement and region splitting.
                
> Balancer goes to move a region but its being split so we fail offlining because NodeExists and it crashes master
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4872
>                 URL: https://issues.apache.org/jira/browse/HBASE-4872
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> So, offlining should deal with there being an existing znode in SPLITTING of SPLIT state.  Here is log around the master crash.
> {code}
> // BALANCER RUNS
> 2011-11-24 07:41:18,686 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051., src=sv4r9s38,7003,1322110570576, dest=sv4r31s44,7003,1322110570645
> 2011-11-24 07:41:18,686 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051. (offlining)
> 2011-11-24 07:41:18,686 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:7001-0x133d3ee451d0000 Creating unassigned node for f681f38f85f6aad594f24f90559da051 in a CLOSING state 
> 2011-11-24 07:41:18,702 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/unassigned/f681f38f85f6aad594f24f90559da051 already exists and this is not a retry
> 2011-11-24 07:41:18,702 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
> 2011-11-24 07:41:18,704 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected ZK exception creating node CLOSING
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /hbase/unassigned/f681f38f85f6aad594f24f90559da051
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:459)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:441)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndWatch(ZKUtil.java:839)
>         at org.apache.hadoop.hbase.zookeeper.ZKAssign.createNodeClosing(ZKAssign.java:568)
>         at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1766)
>         at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1732)
>         at org.apache.hadoop.hbase.master.AssignmentManager.balance(AssignmentManager.java:2763)
>         at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:871)
>         at org.apache.hadoop.hbase.master.HMaster$1.chore(HMaster.java:735)
>         at org.apache.hadoop.hbase.Chore.run(Chore.java:67)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-11-24 07:41:18,706 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> // HERE THE SPLIT SHOWS UP ON MASTER
> 2011-11-24 07:41:19,310 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_SPLIT, server=sv4r9s38,7003,1322110570576, region=f681f38f85f6aad594f24f90559da051
> 2011-11-24 07:41:19,311 INFO org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region f681f38f85f6aad594f24f90559da051 from server sv4r9s38,7003,1322110570576 but region was not first in SPLITTING state; continuing
> 2011-11-24 07:41:19,311 DEBUG org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handling SPLIT event for f681f38f85f6aad594f24f90559da051; deleting node
> 2011-11-24 07:41:19,311 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:7001-0x133d3ee451d0000 Deleting existing unassigned node for f681f38f85f6aad594f24f90559da051 that is in expected state RS_ZK_REGION_SPLIT
> 2011-11-24 07:41:19,401 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:7001-0x133d3ee451d0000 Successfully deleted unassigned node for region f681f38f85f6aad594f24f90559da051 in expected state RS_ZK_REGION_SPLIT
> 2011-11-24 07:41:19,401 INFO org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT report); parent=TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051. daughter a=TestTable,0936288217,1322120475927.98808642d25058c930fc9e5974f3715d.daughter b=TestTable,0936327449,1322120475927.00ba03057ddc6d3868fb6ef349cb9a2c.
> {code}
> Here is regionserver side of things at time of split:
> {code}
> 2011-11-24 07:41:15,927 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051..  compaction_queue=(1:0), split_queue=1
> 2011-11-24 07:41:15,928 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region TestTable,0936288217,1322120470779.f681f38f85f6aad594f24f90559da051.
> 2011-11-24 07:41:15,928 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: regionserver:7003-0x133d3ee451d000c Creating ephemeral node for f681f38f85f6aad594f24f90559da051 in SPLITTING state
> 2011-11-24 07:41:15,945 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133d3ee451d000c Attempting to transition node f681f38f85f6aad594f24f90559da051 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING
> 2011-11-24 07:41:15,953 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133d3ee451d000c Successfully transitioned node f681f38f85f6aad594f24f90559da051 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING
> ...
> 2011-11-24 07:41:19,401 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133d3ee451d000c Attempt to transition the unassigned node for f681f38f85f6aad594f24f90559da051 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT failed, the node existed and was in the expected state but then when setting data it no longer existed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira