You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "ramkrishna.s.vasudevan (Created) (JIRA)" <ji...@apache.org> on 2012/01/06 17:54:39 UTC

[jira] [Created] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
------------------------------------------------------------------------------------

                 Key: HBASE-5137
                 URL: https://issues.apache.org/jira/browse/HBASE-5137
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.4
            Reporter: ramkrishna.s.vasudevan
            Assignee: ramkrishna.s.vasudevan


I am not sure if this bug was already raised in JIRA.
In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
But as the HDFS was down the check waitOnSafeMode throws IOException.
{code}
try {
        // If FS is in safe mode, just wait till out of it.
        FSUtils.waitOnSafeMode(conf,
          conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
        splitter.splitLog();
      } catch (OrphanHLogAfterSplitException e) {
{code}
We catch the exception
{code}
} catch (IOException e) {
      checkFileSystem();
      LOG.error("Failed splitting " + logDir.toString(), e);
    }
{code}
So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.

Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182060#comment-13182060 ] 

ramkrishna.s.vasudevan commented on HBASE-5137:
-----------------------------------------------

@Ted
In trunk we sleep for a configured time and hence we handle InterruptedException.  But i think that is also not needed as in trunk once we know file system is not available we do Runtime.halt(). If the file system is available why do we need to sleep for some time and then retry.

                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183413#comment-13183413 ] 

stack commented on HBASE-5137:
------------------------------

@Ram Please commit to 0.92 branch also.
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.1, 0.90.6
>
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181963#comment-13181963 ] 

ramkrishna.s.vasudevan commented on HBASE-5137:
-----------------------------------------------

Patch for 0.90.
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182646#comment-13182646 ] 

ramkrishna.s.vasudevan commented on HBASE-5137:
-----------------------------------------------

Can we commit this issue ?
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5137:
------------------------------

    Attachment: 5137-trunk.txt

Suggested patch for TRUNK.
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: 5137-trunk.txt, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182695#comment-13182695 ] 

Zhihong Yu commented on HBASE-5137:
-----------------------------------

Yes.
Please don't forget the patch for TRUNK :-)
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182111#comment-13182111 ] 

Zhihong Yu commented on HBASE-5137:
-----------------------------------

Nicolas might know the reason for introducing hbase.hlog.split.failure.retry.interval parameter

Please provide a patch for 0.92 and TRUNK which adds check for retrySplitting in the following if statement (line 220):
{code}
        if (!checkFileSystem()) {
{code}
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183424#comment-13183424 ] 

ramkrishna.s.vasudevan commented on HBASE-5137:
-----------------------------------------------

Committed to 0.92 also.
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.0, 0.90.6
>
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183934#comment-13183934 ] 

Hudson commented on HBASE-5137:
-------------------------------

Integrated in HBase-TRUNK-security #72 (See [https://builds.apache.org/job/HBase-TRUNK-security/72/])
    HBASE-5137 MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException(Ram & Ted)

ramkrishna : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java

                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.0, 0.90.6
>
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182209#comment-13182209 ] 

Zhihong Yu commented on HBASE-5137:
-----------------------------------

Second patch is fine. 

What do you think of the patch for trunk ?
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181442#comment-13181442 ] 

Zhihong Yu commented on HBASE-5137:
-----------------------------------

Thanks for reporting this, Ram.
This has been fixed in TRUNK:
{code}
      } catch (IOException ioe) {
        LOG.warn("Failed splitting of " + serverNames, ioe);
        if (!checkFileSystem()) {
          LOG.warn("Bad Filesystem, exiting");
          Runtime.getRuntime().halt(1);
        }
{code}
So answer to your question is: yes.
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183918#comment-13183918 ] 

Hudson commented on HBASE-5137:
-------------------------------

Integrated in HBase-0.92-security #71 (See [https://builds.apache.org/job/HBase-0.92-security/71/])
    HBASE-5137    HBASE-5137  MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException (Ram & Ted)

ramkrishna : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java

                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.0, 0.90.6
>
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5137:
------------------------------------------

    Fix Version/s: 0.90.6
                   0.92.1

Committed to 0.90 and trunk.  Do we need to commit in 0.92 also?
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.1, 0.90.6
>
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5137:
------------------------------------------

    Attachment: HBASE-5137.patch

Patch for 0.90 addressing Ted's comment of adding braces.  But did not handle interrupted exception.
@Ted
Pls check if it is ok.
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182014#comment-13182014 ] 

Zhihong Yu commented on HBASE-5137:
-----------------------------------

Minor comment:
{code}
+        if (checkFileSystem() && retrySplitting)
+          LOG.info("Retrying failed log splitting " + logDir.toString());
+        else {
{code}
Please add braces around the log statement.
I think the above check should go into TRUNK as well (aborting in the case of not retrying).

Should we also handle InterruptedException, as TRUNK does ?
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan resolved HBASE-5137.
-------------------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.92.1)
                   0.92.0
    
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.0, 0.90.6
>
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183978#comment-13183978 ] 

Hudson commented on HBASE-5137:
-------------------------------

Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/])
    HBASE-5137 MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException(Ram & Ted)

ramkrishna : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java

                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.0, 0.90.6
>
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-5137:
------------------------------------------

    Attachment: HBASE-5137.patch
    
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Zhihong Yu (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182111#comment-13182111 ] 

Zhihong Yu edited comment on HBASE-5137 at 1/7/12 10:04 PM:
------------------------------------------------------------

Nicolas might know the reason for introducing hbase.hlog.split.failure.retry.interval parameter
                
      was (Author: zhihyu@ebaysf.com):
    Nicolas might know the reason for introducing hbase.hlog.split.failure.retry.interval parameter

Please provide a patch for 0.92 and TRUNK which adds check for retrySplitting in the following if statement (line 220):
{code}
        if (!checkFileSystem()) {
{code}
                  
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: 5137-trunk.txt, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182365#comment-13182365 ] 

ramkrishna.s.vasudevan commented on HBASE-5137:
-----------------------------------------------

@Ted
Patch for trunk is fine.  I still have a doubt on that sleep part. If sleep is really needed then the fix for trunk is needed. 
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181500#comment-13181500 ] 

ramkrishna.s.vasudevan commented on HBASE-5137:
-----------------------------------------------

@Ted
One more thing, we should abort even without checking the file system. Because when we check the file system and if it says the File system is fine then we dont abort. But the log split has any way not happened.


                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182716#comment-13182716 ] 

stack commented on HBASE-5137:
------------------------------

+1 on both patches.
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "ramkrishna.s.vasudevan (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183424#comment-13183424 ] 

ramkrishna.s.vasudevan edited comment on HBASE-5137 at 1/10/12 5:54 PM:
------------------------------------------------------------------------

Committed to 0.92 also.
Thanks for the review Stack.
Thanks to Ted for the patch and review.
                
      was (Author: ram_krish):
    Committed to 0.92 also.
                  
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.0, 0.90.6
>
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181624#comment-13181624 ] 

Zhihong Yu commented on HBASE-5137:
-----------------------------------

@Ram:
Good point. The current logic assumes that if file system check passes, retrying splitting log may succeed.

I think the correct logic should add abortion in the catch block below:
{code}
        } catch (InterruptedException e) {
          LOG.warn("Interrupted, aborting since cannot return w/o splitting at startup");
          Thread.currentThread().interrupt();
          retrySplitting = false;
          Runtime.getRuntime().halt(1);
        }
{code}
                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183531#comment-13183531 ] 

Hudson commented on HBASE-5137:
-------------------------------

Integrated in HBase-0.92 #238 (See [https://builds.apache.org/job/HBase-0.92/238/])
    HBASE-5137    HBASE-5137  MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException (Ram & Ted)

ramkrishna : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java

                
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-5137
>                 URL: https://issues.apache.org/jira/browse/HBASE-5137
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.0, 0.90.6
>
>         Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
>         // If FS is in safe mode, just wait till out of it.
>         FSUtils.waitOnSafeMode(conf,
>           conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>     }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira