You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2012/11/07 20:14:13 UTC
[jira] [Commented] (HBASE-6088) Region splitting not happened for long time due to ZK exception while creating RS_ZK_SPLITTING node

    [ https://issues.apache.org/jira/browse/HBASE-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492605#comment-13492605 ] 

stack commented on HBASE-6088:
------------------------------

The delete of the zk node should provide the sequence id so we don't delete a znode we were not responsible for making.

This seems radical:

{code}
+    } catch (KeeperException.NoNodeException nn) {
+      if (abort) {
+        server.abort("Failed cleanup of " + hri.getRegionNameAsString(), nn);
+      }
{code}

If we don't find our splitting node we abort?

Good test.
                
>  Region splitting not happened for long time due to ZK exception while creating RS_ZK_SPLITTING node
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6088
>                 URL: https://issues.apache.org/jira/browse/HBASE-6088
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: Gopinathan A
>            Assignee: rajeshbabu
>             Fix For: 0.92.2, 0.94.1, 0.96.0
>
>         Attachments: addendum_6088_94.patch, HBASE-6088_92.patch, HBASE-6088_94_2.patch, HBASE-6088_94_3.patch, HBASE-6088_94.patch, HBASE-6088_trunk_2.patch, HBASE-6088_trunk_3.patch, HBASE-6088_trunk_4.patch, HBASE-6088_trunk.patch
>
>
> Region splitting not happened for long time due to ZK exception while creating RS_ZK_SPLITTING node
> {noformat}
> 2012-05-24 01:45:41,363 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 26668ms for sessionid 0x1377a75f41d0012, closing socket connection and attempting reconnect
> 2012-05-24 01:45:41,464 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/bd1079bf948c672e493432020dc0e144
> {noformat}
> {noformat}
> 2012-05-24 01:45:43,300 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: cleanupCurrentWriter  waiting for transactions to get synced  total 189377 synced till here 189365
> 2012-05-24 01:45:48,474 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of ufdr,011365398471659,1337823505339.bd1079bf948c672e493432020dc0e144.; Failed setting SPLITTING znode on ufdr,011365398471659,1337823505339.bd1079bf948c672e493432020dc0e144.
> java.io.IOException: Failed setting SPLITTING znode on ufdr,011365398471659,1337823505339.bd1079bf948c672e493432020dc0e144.
> 	at org.apache.hadoop.hbase.regionserver.SplitTransaction.createDaughters(SplitTransaction.java:242)
> 	at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:450)
> 	at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/unassigned/bd1079bf948c672e493432020dc0e144
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246)
> 	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:321)
> 	at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:659)
> 	at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:811)
> 	at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:747)
> 	at org.apache.hadoop.hbase.regionserver.SplitTransaction.transitionNodeSplitting(SplitTransaction.java:919)
> 	at org.apache.hadoop.hbase.regionserver.SplitTransaction.createNodeSplitting(SplitTransaction.java:869)
> 	at org.apache.hadoop.hbase.regionserver.SplitTransaction.createDaughters(SplitTransaction.java:239)
> 	... 5 more
> 2012-05-24 01:45:48,476 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Successful rollback of failed split of ufdr,011365398471659,1337823505339.bd1079bf948c672e493432020dc0e144.
> {noformat}
> {noformat}
> 2012-05-24 01:47:28,141 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/unassigned/bd1079bf948c672e493432020dc0e144 already exists and this is not a retry
> 2012-05-24 01:47:28,142 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of ufdr,011365398471659,1337823505339.bd1079bf948c672e493432020dc0e144.; Failed create of ephemeral /hbase/unassigned/bd1079bf948c672e493432020dc0e144
> java.io.IOException: Failed create of ephemeral /hbase/unassigned/bd1079bf948c672e493432020dc0e144
> 	at org.apache.hadoop.hbase.regionserver.SplitTransaction.createNodeSplitting(SplitTransaction.java:865)
> 	at org.apache.hadoop.hbase.regionserver.SplitTransaction.createDaughters(SplitTransaction.java:239)
> 	at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:450)
> 	at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
> {noformat}
> Due to the above exception, region splitting was failing contineously more than 5hrs

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira