You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Chinna Rao Lalam (Commented) (JIRA)" <ji...@apache.org> on 2012/03/21 07:29:41 UTC

[jira] [Commented] (HBASE-5606) SplitLogManger async delete node hangs log splitting when ZK connection is lost

    [ https://issues.apache.org/jira/browse/HBASE-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234152#comment-13234152 ] 

Chinna Rao Lalam commented on HBASE-5606:
-----------------------------------------

This situation can come in 0.92 when

1)First time SplitLogManager installed one task after that it is not able to connect to Zookeeper(Because of CONNECTIONLOSS).

so the GetDataAsyncCallback will fail and it will retry which is register at the time of createNode() in installTask() or in TimeoutMonitor.

{noformat}
19:32:24,657 WARN org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback: getdata rc = CONNECTIONLOSS /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020.1331752316170 retry=0
{noformat}

2)When ever the GetDataAsyncCallback retry=0 it will call setDone() here it will increment batch.error and it will register one DeleteAsyncCallback.

3)So here installed != done so SplilogManger will throw exception and it will submit again.

4)"failed to set data watch" is happened 92 times so 92 DeleteAsyncCallback are registered and all 92 DeleteAsyncCallback will try till it success.

{noformat}
19:34:30,874 WARN org.apache.hadoop.hbase.master.SplitLogManager: failed to set data watch /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
{noformat}

5) Because of Point:3 SplitLogManager will try to install the task but it found already installed task is FAILURE so it is waiting to change to DELETED

6)Once it got the Zookeer connection one of the DeleteAsyncCallback deleted the node  and it will notify the task which is waiting at Point:5

{noformat}
19:34:31,196 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
{noformat}

7) Point:5 after notified it will crete the node

{noformat}
19:34:32,387 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
{noformat}

8) But already registered DeleteAsyncCallback's will execute and it will delete newly created node at Point:7

{noformat}
19:34:32,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
{noformat}

9) So because of the node is deleted and it removed form tasks map flow wont come to the piece of code to increment the batch.done or batch.error in setDone().
So waitTask() will be in infinite looping and it wont come out.

{noformat}
"MASTER_META_SERVER_OPERATIONS-HOST-192-168-47-204,60000,1331719909985-1" prio=10 tid=0x0000000040d7c000 nid=0x624b in Object.wait() [0x00007ff090482000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at org.apache.hadoop.hbase.master.SplitLogManager.waitTasks(SplitLogManager.java:316)
	- locked <0x000000078e6c4258> (a org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch)
	at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:262)
{noformat}
                
> SplitLogManger async delete node hangs log splitting when ZK connection is lost 
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-5606
>                 URL: https://issues.apache.org/jira/browse/HBASE-5606
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.92.0
>            Reporter: Gopinathan A
>            Priority: Critical
>             Fix For: 0.92.2
>
>
> 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting;
> 2. All tasks are failed due to ZK connection lost, so the all the tasks were deleted asynchronously;
> 3. Servershutdownhandler retried the log splitting;
> 4. The asynchronously deletion in step 2 finally happened for new task
> 5. This made the SplitLogManger in hanging state.
> This leads to .META. region not assigened for long time
> {noformat}
> hbase-root-master-HOST-192-168-47-204.log.2012-03-14"(55413,79):2012-03-14 19:28:47,932 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
> hbase-root-master-HOST-192-168-47-204.log.2012-03-14"(89303,79):2012-03-14 19:34:32,387 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
> {noformat}
> {noformat}
> hbase-root-master-HOST-192-168-47-204.log.2012-03-14"(80417,99):2012-03-14 19:34:31,196 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
> hbase-root-master-HOST-192-168-47-204.log.2012-03-14"(89456,99):2012-03-14 19:34:32,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira