You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jimmy Xiang (Created) (JIRA)" <ji...@apache.org> on 2011/12/28 00:24:30 UTC

[jira] [Created] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Key: HBASE-5099
URL: https://issues.apache.org/jira/browse/HBASE-5099
Project: HBase
Issue Type: Bug
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang

A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager
installed the tasks asynchronously, then started to wait for them to complete.

The task znodes were not created actually. The requests were just queued.
At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session.
During the recovery, a new zookeeper connection was created. However, this master became the
new master again. It tried to assign root and meta.

Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
This waiting holds the zookeeper event thread. So the async create split task is never retried since
there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang reassigned HBASE-5099:
----------------------------------

    Assignee: Jimmy Xiang
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176826#comment-13176826 ] 

Hadoop QA commented on HBASE-5099:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508783/hbase-5099.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated -151 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/613//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/613//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/613//console

This message is automatically generated.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178010#comment-13178010 ] 

Hudson commented on HBASE-5099:
-------------------------------

Integrated in HBase-TRUNK-security #56 (See [https://builds.apache.org/job/HBase-TRUNK-security/56/])
    HBASE-5099  ZK event thread waiting for root region assignment may block server
               shutdown handler for the region sever the root region was on (Jimmy)

tedyu : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Jimmy Xiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177860#comment-13177860 ] 

Jimmy Xiang commented on HBASE-5099:
------------------------------------

I tried to debug this testcase but it doesn't stop at the changes I did.
                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176859#comment-13176859 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/
-----------------------------------------------------------

(Updated 2011-12-28 22:38:05.839771)


Review request for hbase, Ted Yu and Michael Stack.


Changes
-------

Thanks for the comment.  Here is a new patch.


Summary
-------

Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.


This addresses bug HBASE-5099.
    https://issues.apache.org/jira/browse/HBASE-5099


Diffs (updated)
-----

  src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
  src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270 

Diff: https://reviews.apache.org/r/3323/diff


Testing
-------

mvn -PlocalTests -Dtest=TestMaster* clean test


Thanks,

Jimmy


                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177438#comment-13177438 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------

bq.  On 2011-12-29 22:29:34, Lars Hofhansl wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1464
bq.  > <https://reviews.apache.org/r/3323/diff/5/?file=65971#file65971line1464>
bq.  >
bq.  >     Should we issue executor.shutdownNow() here if the worker did not finish, in order to attempt to interrupt and long running worker?

Yes, that will be safer.

- Jimmy

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/#review4154
-----------------------------------------------------------

On 2011-12-29 21:16:22, Jimmy Xiang wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3323/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-29 21:16:22)
bq.  
bq.  
bq.  Review request for hbase, Ted Yu and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.
bq.  
bq.  I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.
bq.  
bq.  
bq.  This addresses bug HBASE-5099.
bq.      https://issues.apache.org/jira/browse/HBASE-5099
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
bq.    src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/3323/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn -PlocalTests -Dtest=TestMaster* clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.

> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177373#comment-13177373 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

bq. How about just throw ExecutionException? 
That is fine. tryRecoveringExpiredZKSession() is private.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Attachment: hbase-5099-v3.patch
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177363#comment-13177363 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

TestMasterZKSessionRecovery is a good name for the new test.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176798#comment-13176798 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/#review4136
-----------------------------------------------------------


Nice effort.


src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/3323/#comment9339>

    We should check the return value here.
    False return value means timeout.



src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/3323/#comment9340>

    I think we should catch ExecutionException (ee) coming out of the get() call and rethrow ee.getCause().



src/test/java/org/apache/hadoop/hbase/master/TestMaster.java
<https://reviews.apache.org/r/3323/#comment9343>

    TestMaster doesn't have timeout parameter.
    We'd better add timeout parameter here.



src/test/java/org/apache/hadoop/hbase/master/TestMaster.java
<https://reviews.apache.org/r/3323/#comment9341>

    Type: recovery.



src/test/java/org/apache/hadoop/hbase/master/TestMaster.java
<https://reviews.apache.org/r/3323/#comment9342>

    I would expect some assertion here.


- Ted


On 2011-12-28 19:31:45, Jimmy Xiang wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3323/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-28 19:31:45)
bq.  
bq.  
bq.  Review request for hbase, Ted Yu and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.
bq.  
bq.  I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.
bq.  
bq.  
bq.  This addresses bug HBASE-5099.
bq.      https://issues.apache.org/jira/browse/HBASE-5099
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
bq.    src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270 
bq.  
bq.  Diff: https://reviews.apache.org/r/3323/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn -PlocalTests -Dtest=TestMaster* clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.


                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177342#comment-13177342 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

bq. We shouldn't pass 1 here since that means 1 master.
The new test is different from TestMasterFailover so I guess 1 could be fine here.

How about naming the new test TestZKSessionRecoveryInMaster ?
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176430#comment-13176430 ] 

Jimmy Xiang commented on HBASE-5099:
------------------------------------

This is good.  If introducing a timeout, I prefer to do it for tryRecoveringExpiredZKSession().  
The reason for that is, other than waitForAssignment, there are several other places which have the waiting logic as well,
such as bulkAssign(), waitForRoot(), this.activeMasterManager.blockUntilBecomingActiveMaster(startupStatus), etc.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177991#comment-13177991 ] 

Hudson commented on HBASE-5099:
-------------------------------

Integrated in HBase-0.92 #221 (See [https://builds.apache.org/job/HBase-0.92/221/])
    HBASE-5099  ZK event thread waiting for root region assignment may block server
               shutdown handler for the region sever the root region was on (Jimmy)

tedyu : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177603#comment-13177603 ] 

Hudson commented on HBASE-5099:
-------------------------------

Integrated in HBase-TRUNK #2593 (See [https://builds.apache.org/job/HBase-TRUNK/2593/])
    HBASE-5099  ZK event thread waiting for root region assignment may block server
               shutdown handler for the region sever the root region was on (Jimmy)

tedyu : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Attachment: hbase-5099-v4.patch
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5099:
------------------------------

    Attachment:     (was: 5099.92)
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177351#comment-13177351 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------

bq.  On 2011-12-29 19:02:17, Ted Yu wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1475
bq.  > <https://reviews.apache.org/r/3323/diff/3/?file=65929#file65929line1475>
bq.  >
bq.  >     Timeout value should be included in the exception message.

It may not because of timeout.

bq.  On 2011-12-29 19:02:17, Ted Yu wrote:
bq.  > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 2
bq.  > <https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line2>
bq.  >
bq.  >     Year is not needed.

I copied it from another test case.  I can remove it.

bq.  On 2011-12-29 19:02:17, Ted Yu wrote:
bq.  > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 34
bq.  > <https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line34>
bq.  >
bq.  >     This shouldn't be small test since mini cluster is involved.

Ok.

bq.  On 2011-12-29 19:02:17, Ted Yu wrote:
bq.  > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 41
bq.  > <https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line41>
bq.  >
bq.  >     We shouldn't pass 1 here since that means 1 master.

It means 1 region server.  Actually, I do want 1 master so it can recover itself.  Otherwise, the backup master will take over and the active master doesn't have a chance to recover in this scenario.

bq.  On 2011-12-29 19:02:17, Ted Yu wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1452
bq.  > <https://reviews.apache.org/r/3323/diff/3/?file=65929#file65929line1452>
bq.  >
bq.  >     Line is 88 chars long.
bq.  >     If ExecutorService is imported, the line should be much shorter.

We can't do this since there is already an ExecutorService (from hbase) imported.  I can't use the hbase ExecutorService because it doesn't fit.

bq.  On 2011-12-29 19:02:17, Ted Yu wrote:
bq.  > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 50
bq.  > <https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line50>
bq.  >
bq.  >     Since the test doesn't involve standby master, I think we should use a different name.

There is test case called TestMasterFailover which involves standby master.  That's why I called it TestMasterRecovery.  How about TestMasterZKSessionRecovery?

- Jimmy

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/#review4148
-----------------------------------------------------------

On 2011-12-29 18:38:54, Jimmy Xiang wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3323/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-29 18:38:54)
bq.  
bq.  
bq.  Review request for hbase, Ted Yu and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.
bq.  
bq.  I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.
bq.  
bq.  
bq.  This addresses bug HBASE-5099.
bq.      https://issues.apache.org/jira/browse/HBASE-5099
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
bq.    src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/3323/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn -PlocalTests -Dtest=TestMaster* clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.

> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177386#comment-13177386 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/#review4152
-----------------------------------------------------------

I like this version.

Minor comments below.

src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/3323/#comment9377>

    Line should wrap after = sign.

src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/3323/#comment9376>

    The catch block is redundant.

src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
<https://reviews.apache.org/r/3323/#comment9378>

    Javadoc for test class.

src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
<https://reviews.apache.org/r/3323/#comment9379>

    How about giving "hbase.master.zksession.recover.timeout" a much small value for the mini cluster so that the timeout value on line 73 can be lowered ?

- Ted

On 2011-12-29 20:41:26, Jimmy Xiang wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3323/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-29 20:41:26)
bq.  
bq.  
bq.  Review request for hbase, Ted Yu and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.
bq.  
bq.  I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.
bq.  
bq.  
bq.  This addresses bug HBASE-5099.
bq.      https://issues.apache.org/jira/browse/HBASE-5099
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
bq.    src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/3323/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn -PlocalTests -Dtest=TestMaster* clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.

> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177449#comment-13177449 ] 

Lars Hofhansl commented on HBASE-5099:
--------------------------------------

Yep... That's better.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Status: Open  (was: Patch Available)
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177877#comment-13177877 ] 

Hudson commented on HBASE-5099:
-------------------------------

Integrated in HBase-0.92 #219 (See [https://builds.apache.org/job/HBase-0.92/219/])
    HBASE-5099 revert due to continuous 0.92 build failures

tedyu : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177311#comment-13177311 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/
-----------------------------------------------------------

(Updated 2011-12-29 18:38:54.473524)


Review request for hbase, Ted Yu and Michael Stack.


Changes
-------

Updated per Ted's comment.  Also separated the test class and added a possible testcase.


Summary
-------

Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.


This addresses bug HBASE-5099.
    https://issues.apache.org/jira/browse/HBASE-5099


Diffs (updated)
-----

  src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
  src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION 

Diff: https://reviews.apache.org/r/3323/diff


Testing
-------

mvn -PlocalTests -Dtest=TestMaster* clean test


Thanks,

Jimmy


                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177572#comment-13177572 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

Integrated to 0.92 and TRUNK, respectively.

Thanks for the patch Jimmy.

Thanks for the suggestion, Lars.
                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176436#comment-13176436 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

tryRecoveringExpiredZKSession() is private and only called by abortNow().
Are you suggesting adding timeout for all the methods mentioned above ?
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Status: Open  (was: Patch Available)
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Attachment: ZK-event-thread-waiting-for-root.png
                distributed-log-splitting-hangs.png
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177376#comment-13177376 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/
-----------------------------------------------------------

(Updated 2011-12-29 20:41:26.086617)


Review request for hbase, Ted Yu and Michael Stack.


Changes
-------

Renamed the test class.  tryRecoveringExpiredZKSession() throws ExecutionException now.


Summary
-------

Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.


This addresses bug HBASE-5099.
    https://issues.apache.org/jira/browse/HBASE-5099


Diffs (updated)
-----

  src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
  src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION 

Diff: https://reviews.apache.org/r/3323/diff


Testing
-------

mvn -PlocalTests -Dtest=TestMaster* clean test


Thanks,

Jimmy


                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5099:
------------------------------

    Attachment: 5099.92

Patch for 0.92 branch where all TestMasterXX tests passed.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5099:
------------------------------

    Fix Version/s: 0.94.0
                   0.92.0
     Hadoop Flags: Reviewed
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177439#comment-13177439 ] 

Lars Hofhansl commented on HBASE-5099:
--------------------------------------

Sorry I did not state this clearly. I meant if (and only if) the awaitTerminition times out, we should call shutdownNow() to interrupt the worker (that is presumably still working).

Like this:
{code}
if (executor.awaitTermination(timeout, TimeUnit.MILLISECONDS)
    && result.isDone()) {
  Boolean recovered = result.get();
  if (recovered != null) {
    return recovered.booleanValue();
  }
} else {
  executor.shutdownNow();
}
return false;
{code}

                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176406#comment-13176406 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

bq. Another one is to abort the master
The above sounds better. How about introducing a timeout for:
{code}
      this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
{code}
after which master aborts ?
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5099:
------------------------------

    Attachment: 5099.92

Add call to shutdownNow().
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Jimmy Xiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177886#comment-13177886 ] 

Jimmy Xiang commented on HBASE-5099:
------------------------------------

TestReplication#queueFailover has a bug that's why it is flaky:

https://issues.apache.org/jira/browse/HBASE-5112
                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Attachment: hbase-5099-v2.patch
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5099:
------------------------------

    Summary: ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on  (was: ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on)
    
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176923#comment-13176923 ] 

Hadoop QA commented on HBASE-5099:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508799/hbase-5099-v2.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated -151 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/617//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/617//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/617//console

This message is automatically generated.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177913#comment-13177913 ] 

Hudson commented on HBASE-5099:
-------------------------------

Integrated in HBase-0.92-security #55 (See [https://builds.apache.org/job/HBase-0.92-security/55/])
    HBASE-5099  ZK event thread waiting for root region assignment may block server
               shutdown handler for the region sever the root region was on (Jimmy)

tedyu : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177587#comment-13177587 ] 

Hudson commented on HBASE-5099:
-------------------------------

Integrated in HBase-0.92 #215 (See [https://builds.apache.org/job/HBase-0.92/215/])
    HBASE-5099  ZK event thread waiting for root region assignment may block server
               shutdown handler for the region sever the root region was on (Jimmy)

tedyu : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Status: Patch Available  (was: Open)

Updated per review comments.  The failed tests: TestHFileOutputFormat,TestTableMapReduce work fine on my box.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)
    
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177404#comment-13177404 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

@Jimmy:
Do you mind producing a patch for 0.92 ?
I got the following when I tried to apply patch v5 to 0.92:
{code}
1 out of 3 hunks FAILED -- saving rejects to file src/main/java/org/apache/hadoop/hbase/master/HMaster.java.rej
{code}

Also there is no test category in 0.92:
{code}
[ERROR] /Users/zhihyu/92hbase/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java:[27,30] cannot find symbol
[ERROR] symbol  : class MediumTests
[ERROR] location: package org.apache.hadoop.hbase
{code}

This is a whitespace at line 41 in TestMasterZKSessionRecovery - I can remove it at time of integration.

Nice work.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176356#comment-13176356 ] 

Jimmy Xiang commented on HBASE-5099:
------------------------------------

The timeout monitor thread in split log manager keeps logging this:

2011-12-27 00:00:06,759 DEBUG master.SplitLogManager (SplitLogManager.java:chore(846)) - total tasks = 12 unassigned = 12

It shows no task assigned since no task znode was created yet.

Checked zookeeper, and the there was nothing under /hbase/splitlog
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177846#comment-13177846 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

I reverted 0.92 patch.

Now TestReplication passes on Mac.

Let's find out if the patch is related to replication test failure or not.

Keeping TRUNK patch in TRUNK for now since trunk build 2594 passed.
                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177863#comment-13177863 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

Test scripts from HBASE-4480 would be useful in reproducing the test failure.
You can run TestReplication#queueFailover in a loop (on different OSes).
                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177477#comment-13177477 ] 

Hadoop QA commented on HBASE-5099:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508893/hbase-5099-v6.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 2 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated -151 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/634//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/634//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/634//console

This message is automatically generated.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Status: Patch Available  (was: Open)
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177375#comment-13177375 ] 

Hadoop QA commented on HBASE-5099:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508868/hbase-5099-v3.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 2 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated -151 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/628//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/628//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/628//console

This message is automatically generated.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176778#comment-13176778 ] 

Jimmy Xiang commented on HBASE-5099:
------------------------------------

Cool, let me submit a patch.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177487#comment-13177487 ] 

Hadoop QA commented on HBASE-5099:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508893/hbase-5099-v6.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 2 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated -151 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/632//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/632//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/632//console

This message is automatically generated.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177369#comment-13177369 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------

bq.  On 2011-12-28 22:52:56, Ted Yu wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1463
bq.  > <https://reviews.apache.org/r/3323/diff/2/?file=65902#file65902line1463>
bq.  >
bq.  >     The call() at line 1429 would only throw 3 types of exceptions: IE, IOE and KE.
bq.  >     
bq.  >     So the cause should be one the three types.
bq.  >     I suggest using instanceof to check against each of them and if a match is found, cast as that exception type and rethrow.
bq.  >     
bq.  >     If there is no match, I suggest using this ctor to throw a new IOException:
bq.  >     IOException(String message, Throwable cause)

How about just throw ExecutionException?  It will be much simpler and the caller doesn't care about which exception it is in this case.

- Jimmy

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/#review4138
-----------------------------------------------------------

On 2011-12-29 18:38:54, Jimmy Xiang wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3323/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-29 18:38:54)
bq.  
bq.  
bq.  Review request for hbase, Ted Yu and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.
bq.  
bq.  I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.
bq.  
bq.  
bq.  This addresses bug HBASE-5099.
bq.      https://issues.apache.org/jira/browse/HBASE-5099
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
bq.    src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/3323/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn -PlocalTests -Dtest=TestMaster* clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.

> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177443#comment-13177443 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/#review4156
-----------------------------------------------------------

Ship it!

- Lars

On 2011-12-29 22:48:36, Jimmy Xiang wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3323/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-29 22:48:36)
bq.  
bq.  
bq.  Review request for hbase, Ted Yu and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.
bq.  
bq.  I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.
bq.  
bq.  
bq.  This addresses bug HBASE-5099.
bq.      https://issues.apache.org/jira/browse/HBASE-5099
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
bq.    src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/3323/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn -PlocalTests -Dtest=TestMaster* clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.

> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Attachment: hbase-5099-v6.patch

Updated per Lars's comment.  It will be safer.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177327#comment-13177327 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/#review4148
-----------------------------------------------------------


The code is much closer to what was in my mind.
The new test needs improvement.


src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/3323/#comment9361>

    Line is 88 chars long.
    If ExecutorService is imported, the line should be much shorter.



src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/3323/#comment9364>

    I suggest recording the amount of time taken by the recovery and log it after executor.awaitTermination() returns.



src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/3323/#comment9362>

    Timeout value should be included in the exception message.



src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java
<https://reviews.apache.org/r/3323/#comment9365>

    Year is not needed.



src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java
<https://reviews.apache.org/r/3323/#comment9363>

    This shouldn't be small test since mini cluster is involved.



src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java
<https://reviews.apache.org/r/3323/#comment9366>

    Please add short javadoc for this new class.



src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java
<https://reviews.apache.org/r/3323/#comment9368>

    We shouldn't pass 1 here since that means 1 master.



src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java
<https://reviews.apache.org/r/3323/#comment9367>

    Since the test doesn't involve standby master, I think we should use a different name.


- Ted


On 2011-12-29 18:38:54, Jimmy Xiang wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3323/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-29 18:38:54)
bq.  
bq.  
bq.  Review request for hbase, Ted Yu and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.
bq.  
bq.  I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.
bq.  
bq.  
bq.  This addresses bug HBASE-5099.
bq.      https://issues.apache.org/jira/browse/HBASE-5099
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
bq.    src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/3323/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn -PlocalTests -Dtest=TestMaster* clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.


                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177441#comment-13177441 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/
-----------------------------------------------------------

(Updated 2011-12-29 22:48:36.725034)


Review request for hbase, Ted Yu and Michael Stack.


Summary
-------

Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.


This addresses bug HBASE-5099.
    https://issues.apache.org/jira/browse/HBASE-5099


Diffs (updated)
-----

  src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
  src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION 

Diff: https://reviews.apache.org/r/3323/diff


Testing
-------

mvn -PlocalTests -Dtest=TestMaster* clean test


Thanks,

Jimmy


                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176783#comment-13176783 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/
-----------------------------------------------------------

Review request for hbase, Ted Yu and Michael Stack.


Summary
-------

Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.


This addresses bug HBASE-5099.
    https://issues.apache.org/jira/browse/HBASE-5099


Diffs
-----

  src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
  src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270 

Diff: https://reviews.apache.org/r/3323/diff


Testing
-------

mvn -PlocalTests -Dtest=TestMaster* clean test


Thanks,

Jimmy


                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177446#comment-13177446 ] 

Jimmy Xiang commented on HBASE-5099:
------------------------------------

There is no harm to call shutdownNow() even if awaitTerminition is not timed out.

So this should be fine:

if (executor.awaitTermination(timeout, TimeUnit.MILLISECONDS)
    && result.isDone()) {
  Boolean recovered = result.get();
  if (recovered != null) {
    return recovered.booleanValue();
  }
}
executor.shutdownNow();

                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177431#comment-13177431 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/#review4154
-----------------------------------------------------------

src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/3323/#comment9383>

    Should we issue executor.shutdownNow() here if the worker did not finish, in order to attempt to interrupt and long running worker?

- Lars

On 2011-12-29 21:16:22, Jimmy Xiang wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3323/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-29 21:16:22)
bq.  
bq.  
bq.  Review request for hbase, Ted Yu and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.
bq.  
bq.  I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.
bq.  
bq.  
bq.  This addresses bug HBASE-5099.
bq.      https://issues.apache.org/jira/browse/HBASE-5099
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
bq.    src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/3323/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn -PlocalTests -Dtest=TestMaster* clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.

> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177397#comment-13177397 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/
-----------------------------------------------------------

(Updated 2011-12-29 21:16:22.655431)


Review request for hbase, Ted Yu and Michael Stack.


Changes
-------

Minor changes per review comments.


Summary
-------

Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.


This addresses bug HBASE-5099.
    https://issues.apache.org/jira/browse/HBASE-5099


Diffs (updated)
-----

  src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
  src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION 

Diff: https://reviews.apache.org/r/3323/diff


Testing
-------

mvn -PlocalTests -Dtest=TestMaster* clean test


Thanks,

Jimmy


                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176867#comment-13176867 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/#review4138
-----------------------------------------------------------

src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/3323/#comment9344>

    I think declaring ExecutionException is better than declaring Throwable.
    See comment below.

src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/3323/#comment9345>

    The call() at line 1429 would only throw 3 types of exceptions: IE, IOE and KE.

    So the cause should be one the three types.
    I suggest using instanceof to check against each of them and if a match is found, cast as that exception type and rethrow.

    If there is no match, I suggest using this ctor to throw a new IOException:
    IOException(String message, Throwable cause) 

- Ted

On 2011-12-28 22:38:05, Jimmy Xiang wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3323/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-28 22:38:05)
bq.  
bq.  
bq.  Review request for hbase, Ted Yu and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.
bq.  
bq.  I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.
bq.  
bq.  
bq.  This addresses bug HBASE-5099.
bq.      https://issues.apache.org/jira/browse/HBASE-5099
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
bq.    src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270 
bq.  
bq.  Diff: https://reviews.apache.org/r/3323/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn -PlocalTests -Dtest=TestMaster* clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.

> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Attachment: hbase-5099-v5.patch
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Jimmy Xiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177847#comment-13177847 ] 

Jimmy Xiang commented on HBASE-5099:
------------------------------------

TestReplication is flaky.  But it works on my ubuntu box.
Let me take a look.
                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177421#comment-13177421 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

@Jimmy:
Can you put the patch on a real cluster and see if the 300 second timeout is good for zk session recovery ?

Thanks
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177410#comment-13177410 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

Will integrate tomorrow morning if I don't get objections.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177850#comment-13177850 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

Please read through the test output of 0.92 builds 217 and 218.
With patch 5099.92, the test failure is reproducible on MacBook.

Another validation is to deploy patch 5099.92 to real clusters and see if replication works.
                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177890#comment-13177890 ] 

Hudson commented on HBASE-5099:
-------------------------------

Integrated in HBase-0.92-security #54 (See [https://builds.apache.org/job/HBase-0.92-security/54/])
    HBASE-5099 revert due to continuous 0.92 build failures
HBASE-5099  ZK event thread waiting for root region assignment may block server
               shutdown handler for the region sever the root region was on (Jimmy)

tedyu : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

tedyu : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177417#comment-13177417 ] 

Jimmy Xiang commented on HBASE-5099:
------------------------------------

@Ted, thanks!
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Jimmy Xiang (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang resolved HBASE-5099.
--------------------------------

    Resolution: Fixed
    
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Attachment: hbase-5099.patch
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177394#comment-13177394 ] 

jiraposter@reviews.apache.org commented on HBASE-5099:
------------------------------------------------------

bq.  On 2011-12-29 20:55:31, Ted Yu wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1463
bq.  > <https://reviews.apache.org/r/3323/diff/4/?file=65969#file65969line1463>
bq.  >
bq.  >     The catch block is redundant.

That's right.  I forgot it :)

bq.  On 2011-12-29 20:55:31, Ted Yu wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1453
bq.  > <https://reviews.apache.org/r/3323/diff/4/?file=65969#file65969line1453>
bq.  >
bq.  >     Line should wrap after = sign.

too long?  sure.

bq.  On 2011-12-29 20:55:31, Ted Yu wrote:
bq.  > src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java, line 41
bq.  > <https://reviews.apache.org/r/3323/diff/4/?file=65970#file65970line41>
bq.  >
bq.  >     How about giving "hbase.master.zksession.recover.timeout" a much small value for the mini cluster so that the timeout value on line 73 can be lowered ?

sure

- Jimmy

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3323/#review4152
-----------------------------------------------------------

On 2011-12-29 20:41:26, Jimmy Xiang wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3323/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-29 20:41:26)
bq.  
bq.  
bq.  Review request for hbase, Ted Yu and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.
bq.  
bq.  I added a test to test the abort method.  However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted.  So the actually success recovery is not tested.
bq.  
bq.  
bq.  This addresses bug HBASE-5099.
bq.      https://issues.apache.org/jira/browse/HBASE-5099
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 
bq.    src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/3323/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn -PlocalTests -Dtest=TestMaster* clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.

> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177433#comment-13177433 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

I think executor.shutdown() is appropriate here since we want the original body of tryRecoveringExpiredZKSession() to complete.
There is new timeout governing the wait for shutdown to complete.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177489#comment-13177489 ] 

Hadoop QA commented on HBASE-5099:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508896/5099.92
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/636//console

This message is automatically generated.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177900#comment-13177900 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

Integrated 5099.92 to 0.92 branch again.
                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jimmy Xiang updated HBASE-5099:
-------------------------------

    Status: Patch Available  (was: Open)
    
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176447#comment-13176447 ] 

Jimmy Xiang commented on HBASE-5099:
------------------------------------

tryRecoveringExpiredZKSession() is only called by abortNow(), which is called by abort(), which is called by the
eventThread.  I was thinking to put this whole method in another thread with executor service and time it out after a certain time, for example, 5 minutes, then fails the recovery and let it abort.

This way, we don't have to adding timeout for all the methods.  The regular master startup is not impacted which calls assignRootAndMeta() too.

However, if we know most likely just waitForAssignment() takes a long time, we can add timeout to this method only. But I am not so sure.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Zhihong Yu (Reopened) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu reopened HBASE-5099:
-------------------------------


0.92 Jenkins builds have failed 4 times in a roll.

TestReplication#queueFailover failed in builds 217 and 218.
It failed consistently on MacBook as well.

Rolling back the patches.
                
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5099:
------------------------------

    Comment: was deleted

(was: -1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508896/5099.92
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/636//console

This message is automatically generated.)
    
> ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Jimmy Xiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176383#comment-13176383 ] 

Jimmy Xiang commented on HBASE-5099:
------------------------------------

After we restarted the master, distributed log splitting can go through and the system is back to normal.

This can happen if zk session is expired, during tryRecoveringExpiredZKSession(), if it cannot connect to the region server 
where root or meta is on, this region will be marked dead.  Then distributed log splitting is kicked in.  But it won't go
through since the eventThread is waiting for the root/meta region assigned.

So this problem will happen whenever master tryRecoveringExpiredZKSession and the region server with root/meta is not available.

One fix is to using sync call in accessing zk in distributed log splitting.  Another one is to abort the master anyway
(don't try to recover since it may not be recoverable) if the region server with root/meta region is not reachable.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176454#comment-13176454 ] 

Zhihong Yu commented on HBASE-5099:
-----------------------------------

Timing out tryRecoveringExpiredZKSession() is good.
We can introduce a separate thread to carry the current logic of tryRecoveringExpiredZKSession() and monitor this thread in eventThread.
                
> ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5099
>                 URL: https://issues.apache.org/jira/browse/HBASE-5099
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0, 0.94.0
>            Reporter: Jimmy Xiang
>         Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png
>
>
> A RS died.  The ServerShutdownHandler kicked in and started the logspliting.  SpliLogManager
> installed the tasks asynchronously, then started to wait for them to complete.
> The task znodes were not created actually.  The requests were just queued.
> At this time, the zookeeper connection expired.  HMaster tried to recover the expired ZK session.
> During the recovery, a new zookeeper connection was created.  However, this master became the
> new master again.  It tried to assign root and meta.
> Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
> This waiting holds the zookeeper event thread.  So the async create split task is never retried since
> there is only one event thread, which is waiting for the root region assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira