You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "nkeywal (Created) (JIRA)" <ji...@apache.org> on 2012/04/20 12:38:37 UTC

[jira] [Created] (HBASE-5844) Delete the region servers znode after a regions server crash

Delete the region servers znode after a regions server crash
------------------------------------------------------------

                 Key: HBASE-5844
                 URL: https://issues.apache.org/jira/browse/HBASE-5844
             Project: HBase
          Issue Type: Improvement
          Components: regionserver, scripts
    Affects Versions: 0.96.0
            Reporter: nkeywal
            Assignee: nkeywal


today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.

By deleting the znode in start script, we remove this delay and the recovery starts immediately.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268224#comment-13268224 ] 

nkeywal commented on HBASE-5844:
--------------------------------

Ready to be committed imho.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5844:
---------------------------

    Attachment: 5844.v3.patch
    
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5844:
-------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Committed to trunk.  Thanks for the patch N.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464093#comment-13464093 ] 

nkeywal commented on HBASE-5844:
--------------------------------

It's strange, I didn't reproduce it. I should, because it seems logical. Will look into it and create jiras.
Anyway, there are bugs around this scenario. For example, when it fails we now have a new pid file, but this pid does not match the process. This is true in 0.90 as well. If there is no process, the error for the stop (in 0.96) will be ??no regionserver to stop because kill -0 of pid 49938 failed with status 1??. If another process took this id (yes it should not happen often), the kill will succeed.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463591#comment-13463591 ] 

nkeywal commented on HBASE-5844:
--------------------------------

Humm.
The feature is important imho: waiting 30s (at best) before starting a recovery is really nice.
In a ideal world, ZooKeeper would make this less useful by detecting the dead process sooner, but still it can't be faster than this. 

Note that znode remover should occur when the process finishes, not before starting a new one. What JD describes seems a bug to me.






                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263778#comment-13263778 ] 

nkeywal commented on HBASE-5844:
--------------------------------

I found the issue, and (hopefully) a fix. I will have a new patch middle of next week, I will include the master znode in this one...
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Zhihong Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259914#comment-13259914 ] 

Zhihong Yu commented on HBASE-5844:
-----------------------------------

{code}
+      "(Environment variable HBASE_ZNODE_FILE is no set).");
{code}
'is no set' -> 'is not set'
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch, 5844.v2.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258788#comment-13258788 ] 

nkeywal commented on HBASE-5844:
--------------------------------

I didn't know this parameter. It's interesting, because with ZK the default timeout is 30 seconds, but with HBase it's now 180s (from hbase-default.xml). It was increased to 60s a first time in HBASE-1772. It seems it was increased because of the GC.

But it means that deleting immediately the ZK represents a huge mttr improvement for the regions server crash case.



                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463899#comment-13463899 ] 

nkeywal commented on HBASE-5844:
--------------------------------

btw: I'm having a look at this to understand what's happening.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-5844.
--------------------------

       Resolution: Fixed
    Fix Version/s: 0.96.0
     Hadoop Flags: Reviewed

Committed to trunk.  Thanks for the patch N.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5844:
---------------------------

    Status: Patch Available  (was: Reopened)
    
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258137#comment-13258137 ] 

nkeywal commented on HBASE-5844:
--------------------------------

Patch v1. Tested on a local cluster & pseudo distributed, by stopping the server by kill -9. I will do some minor improvements on the java code and then test on a real cluster, but I'm interested by a feedback on the script.

                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506832#comment-13506832 ] 

Jean-Daniel Cryans commented on HBASE-5844:
-------------------------------------------

Encountered another problem that I think I can link to this jira, I was trying to run HBase from trunk without internet access and like in my Sept 25th comment, I get an empty line after start-hbase.sh but now nothing is running. The .log file doesn't show anything after logging ulimit and nothing's in the .out file. After running some bash -x, I was able to figure out that the nohup output was being suppressed. See:

{noformat}
jdcryans-MBPr:hbase-github jdcryans$ ./bin/start-hbase.sh 
jdcryans-MBPr:hbase-github jdcryans$
jdcryans-MBPr:hbase-github jdcryans$ bash -x ./bin/start-hbase.sh 
... some stuff then
+ /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh start master
jdcryans-MBPr:hbase-github jdcryans$ bash -x /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh start master
... more stuff
+ nohup /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh --config /Users/jdcryans/git/hbase-github/bin/../conf internal_start master
jdcryans-MBPr:hbase-github jdcryans$ nohup /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh --config /Users/jdcryans/git/hbase-github/bin/../conf internal_start master
appending output to nohup.out
{noformat}

So now I see that it's writing to nohup.out, which in turn tells me what really happened:

{noformat}
Caused by: java.lang.ClassNotFoundException: org.apache.zookeeper.KeeperException
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
{noformat}

Reproing can be done by physically deleting any jar listed in target/cached_classpath.txt. In my case I think the jar wasn't available because I had no internet connection.

I wonder what other errors it could hide like this.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5844:
---------------------------

    Attachment: 5844.v4.patch
    
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263369#comment-13263369 ] 

Hudson commented on HBASE-5844:
-------------------------------

Integrated in HBase-TRUNK-security #186 (See [https://builds.apache.org/job/HBase-TRUNK-security/186/])
    HBASE-5844 Delete the region servers znode after a regions server crash; REVERT (Revision 1330983)

     Result = SUCCESS
stack : 
Files : 
* /hbase/trunk/bin/hbase-daemon.sh
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java

                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259921#comment-13259921 ] 

stack commented on HBASE-5844:
------------------------------

@N Ok. I'll commit this then in a new JIRA move out the common code to util or some place.  I'll handle Ted's comment on commit.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch, 5844.v2.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258796#comment-13258796 ] 

stack commented on HBASE-5844:
------------------------------

If we go to a count > 100 we just continue the startup?  Is that what you want?

{code}
+    while (!tracker.checkIfBaseNodeAvailable() && ++count<100) {
+      Thread.sleep(100);
+    }
{code}

Be like the rest of the code regards spaces; i.e. spaces around operators...


+
+    if (fileName==null){


Maybe you don't need deleteMyEphemeralNodeOnDisk if you instead use http://docs.oracle.com/javase/6/docs/api/java/io/File.html#deleteOnExit() inside in writeMyEphemeralNodeOnDisk?

Patch looks good N.

We upped the timeout because noobs would install hbase then run big mapreduce jobs w/o turning jvm and so big GCs.  We figured they'd rather have their regionserver ride over the big pauses than have them be 'sensitive' out of the box.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267352#comment-13267352 ] 

Hadoop QA commented on HBASE-5844:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12525420/5844.v4.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 hadoop23.  The patch compiles against the hadoop 0.23.x profile.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1743//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1743//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1743//console

This message is automatically generated.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259904#comment-13259904 ] 

nkeywal commented on HBASE-5844:
--------------------------------

You're right. I propose to commit this patch, I will then generalize the solution to master in another jira.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch, 5844.v2.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5844:
---------------------------

    Attachment: 5844.v2.patch
    
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch, 5844.v2.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258803#comment-13258803 ] 

nkeywal commented on HBASE-5844:
--------------------------------

For the tracker, it's my private workaround for HBASE-5666, it should not have been included in this patch. Sorry about this.

I think it's better to delete the file explicitly, just after the znode deletion. HRegionServer#deleteMyEphemeralNode is called only once, and I added deleteMyEphemeralNodeOnDisk just after this call. If we rely on #deleteOnExit, I fear we could have the file deleted with a still alive znode. I'm not sure and I have not tried it, but I think it's too easy to enter into the jvm-specific-behavior space here.

I will fix the java code and try the whole fix on a real cluster for the v2.

Thanks you for the review.




                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260046#comment-13260046 ] 

Hudson commented on HBASE-5844:
-------------------------------

Integrated in HBase-TRUNK #2801 (See [https://builds.apache.org/job/HBase-TRUNK/2801/])
    HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1329430)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/bin/hbase-daemon.sh
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java

                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463401#comment-13463401 ] 

Jean-Daniel Cryans commented on HBASE-5844:
-------------------------------------------

One thing that worries about this patch is the situation where the pid file is gone and someone tries to start the region server. It happened to me a bunch of times. I tried it with you patch and since it removes ephemeral znode it _kills_ the region server that's already running and doesn't start a new one because the ports are already occupied.

I'm not sure if this is related to this patch, but we're now missing info when using the scripts. We used to have:

{noformat}
su-jdcryans-2:0.94 jdcryans$ ./bin/start-hbase.sh 
localhost: starting zookeeper, logging to /Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-zookeeper-h-25-185.sfo.stumble.net.out
starting master, logging to /Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-master-h-25-185.sfo.stumble.net.out
localhost: starting regionserver, logging to /Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-regionserver-h-25-185.sfo.stumble.net.out
{noformat}

Now we have:

{noformat}
su-jdcryans-2:trunk-commit jdcryans$ ./bin/start-hbase.sh 

su-jdcryans-2:trunk-commit jdcryans$ 
{noformat}
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal reopened HBASE-5844:
----------------------------


There is a regression when the cluster is fully distributed: the start command hangs. I'm on it. In the meantime, would it be possible to undo the commit?

Sorry about this.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258988#comment-13258988 ] 

stack commented on HBASE-5844:
------------------------------

Ok on your reasoning for not using deleteOnExit.  Try and have the two methods share more code like getting the name of the file w/ the znode name in it.  Otherwise, sounds good.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5844:
-------------------------

    Attachment: 5844.v3.patch

What I committed.  Its v2 + addressing Ted comment.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262773#comment-13262773 ] 

stack commented on HBASE-5844:
------------------------------

I reverted the patch from trunk.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268530#comment-13268530 ] 

Hudson commented on HBASE-5844:
-------------------------------

Integrated in HBase-TRUNK #2848 (See [https://builds.apache.org/job/HBase-TRUNK/2848/])
    HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1334028)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/bin/hbase-daemon.sh
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java

                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Jean-Daniel Cryans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258422#comment-13258422 ] 

Jean-Daniel Cryans commented on HBASE-5844:
-------------------------------------------

It is deleted automatically, it's an ephemeral znode so it takes zk.session.timeout time to see it disappear.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259811#comment-13259811 ] 

stack commented on HBASE-5844:
------------------------------

Should master write out its znode name too?  If it crashes this code could bring on the second master faster?
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch, 5844.v2.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5844:
---------------------------

    Attachment: 5844.v1.patch
    
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463409#comment-13463409 ] 

stack commented on HBASE-5844:
------------------------------

Looking at this w/ j-d, now we no longer do nohup so the parent process can stick around to watch out for the server crash. This make it so now there are two  hbase processes listed per launched daemon.  This is kinda ugly.

When we have this bash script watching the running java process we verge into the territory normally occupied by babysitters like supervise.   Our parent bash script will always be less than a real babysitter -- supervise, god, etc. -- so maybe we should just have this kill znode as an optional script w/ prescription for how to set it up -- e.g. run znode remover on daemon crash before starting new one (if we want supervise to start a new one).

I'm thinking we should back this out since there are open questions still.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5844:
---------------------------

    Status: Patch Available  (was: Open)
    
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463951#comment-13463951 ] 

Jean-Daniel Cryans commented on HBASE-5844:
-------------------------------------------

To repro: start a local distributed cluster (without hadoop), remove the RS's pid file in /tmp, try starting a region server again. My guess is that since the port is occupied then the second won't start but the znode deletion still runs after.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268906#comment-13268906 ] 

Hudson commented on HBASE-5844:
-------------------------------

Integrated in HBase-TRUNK-security #192 (See [https://builds.apache.org/job/HBase-TRUNK-security/192/])
    HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1334028)

     Result = SUCCESS
stack : 
Files : 
* /hbase/trunk/bin/hbase-daemon.sh
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java

                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259692#comment-13259692 ] 

nkeywal commented on HBASE-5844:
--------------------------------

v2 should be ok. It does not include anymore the fix for HBASE-5666, so it cannot be tested locally but I tried it before removing the workaround. 
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 5844.v1.patch, 5844.v2.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5844:
---------------------------

    Status: Open  (was: Patch Available)
    
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267332#comment-13267332 ] 

nkeywal commented on HBASE-5844:
--------------------------------

v4 should be ok.
I will do another jira for the master.
                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260258#comment-13260258 ] 

Hudson commented on HBASE-5844:
-------------------------------

Integrated in HBase-TRUNK-security #182 (See [https://builds.apache.org/job/HBase-TRUNK-security/182/])
    HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1329430)

     Result = SUCCESS
stack : 
Files : 
* /hbase/trunk/bin/hbase-daemon.sh
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java

                
> Delete the region servers znode after a regions server crash
> ------------------------------------------------------------
>
>                 Key: HBASE-5844
>                 URL: https://issues.apache.org/jira/browse/HBASE-5844
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, scripts
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch
>
>
> today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.
> By deleting the znode in start script, we remove this delay and the recovery starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira