You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (Created) (JIRA)" <ji...@apache.org> on 2011/12/20 23:01:31 UTC

[jira] [Created] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

SplitLogWorker fails to let go of a task, kills the RS
------------------------------------------------------

                 Key: HBASE-5077
                 URL: https://issues.apache.org/jira/browse/HBASE-5077
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.92.0
            Reporter: Jean-Daniel Cryans
            Priority: Critical
             Fix For: 0.92.1


I hope I didn't break spacetime continuum, I got this while testing 0.92.0:

{quote}
2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
        at java.lang.Thread.run(Thread.java:662)
{quote}

I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173682#comment-13173682 ] 

stack commented on HBASE-5077:
------------------------------

Chatting w/ J-D, we shouldn't return out of middle of finally -- should go through to end via the file closes.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>         Attachments: HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173574#comment-13173574 ] 

Jean-Daniel Cryans commented on HBASE-5077:
-------------------------------------------

Won't exitWorker kill the SplitLogWorker fully? Like not just the task, but the RS will actually stop serving log splitting.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-5077:
--------------------------------------

    Status: Patch Available  (was: Open)
    
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>         Attachments: HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173569#comment-13173569 ] 

Zhihong Yu commented on HBASE-5077:
-----------------------------------

After preemption log, the following code should have run:
{code}
  void stopTask() {
    LOG.info("Sending interrupt to stop the worker thread");
    worker.interrupt(); // TODO interrupt often gets swallowed, do what else?
  }
{code}
I think the following method should have been called instead:
{code}
  public void stop() {
    exitWorker = true;
    stopTask();
  }
{code}
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans reassigned HBASE-5077:
-----------------------------------------

    Assignee: Jean-Daniel Cryans
    
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>         Attachments: HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173705#comment-13173705 ] 

Hadoop QA commented on HBASE-5077:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508167/HBASE-5077-v2.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -152 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.io.TestHeapSize

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/560//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/560//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/560//console

This message is automatically generated.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173575#comment-13173575 ] 

Jean-Daniel Cryans edited comment on HBASE-5077 at 12/20/11 10:35 PM:
----------------------------------------------------------------------

One problem I see is that in HLogSplitter.splitLogFileToTemp we do this in the finally:

{code}
if ((progress_failed == false) && (reporter != null) &&
    (reporter.progress() == false)) {
  progress_failed = true;
}
{code}

But at this point progress_failed isn't taken into account so the method returns true. Looking at other parts of that method it seems it's missing a "return false" which would be correctly handled by SplitLogWorker.
                
      was (Author: jdcryans):
    One problem I see is that in HLogSplitter.splitLogFileToTemp we do this in the finally:

{quote}
if ((progress_failed == false) && (reporter != null) &&
    (reporter.progress() == false)) {
  progress_failed = true;
}
{quote}

But at this point progress_failed isn't taken into account so the method returns true. Looking at other parts of that method it seems it's missing a "return false" which would be correctly handled by SplitLogWorker.
                  
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-5077:
--------------------------------------

    Attachment: HBASE-5077.patch

Adds the missing "return false" (I saw it was already fixed in 0.89-fb) and removed progress_failed since it doesn't do anything.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>         Attachments: HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173548#comment-13173548 ] 

Jean-Daniel Cryans commented on HBASE-5077:
-------------------------------------------

This is from the master's POV:

{quote}
2011-12-20 02:59:42,086 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
2011-12-20 02:59:42,089 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 ver = 0
2011-12-20 02:59:42,113 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 acquired by sv4r13s38,62023,1324345934996
2011-12-20 03:00:09,244 INFO org.apache.hadoop.hbase.master.SplitLogManager: resubmitting task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
2011-12-20 03:00:09,302 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 ver = 3
2011-12-20 03:02:53,072 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 acquired by sv4r28s44,62023,1324345934970
2011-12-20 03:03:21,117 INFO org.apache.hadoop.hbase.master.SplitLogManager: resubmitting task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
2011-12-20 03:03:21,136 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 ver = 6
2011-12-20 03:04:40,421 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 acquired by sv4r6s38,62023,1324345935082
2011-12-20 03:05:09,133 INFO org.apache.hadoop.hbase.master.SplitLogManager: resubmitting task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
2011-12-20 03:05:09,144 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 ver = 9
2011-12-20 03:05:09,193 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 acquired by sv4r30s44,62023,1324345935039
2011-12-20 03:05:36,137 INFO org.apache.hadoop.hbase.master.SplitLogManager: Skipping resubmissions of task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 because threshold 3 reached
...
2011-12-20 03:05:47,139 INFO org.apache.hadoop.hbase.master.SplitLogManager: Skipping resubmissions of task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 because threshold 3 reached
2011-12-20 03:05:50,320 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 entered state done sv4r30s44,62023,1324345935039
{quote}

The one that died is sv4r6s38, the 3rd one to acquire the task. Here's its log:

{quote}
2011-12-20 03:04:40,418 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r6s38,62023,1324345935082 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
2011-12-20 03:04:43,574 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:9100/hbase/splitlog/sv4r6s38,62023,1324345935082_hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814/TestTable/18010aed747eeb11d1a95427731bb136/recovered.edits/0000000000000031373.temp, syncFs=true, hflush=false
...
{quote}

Then it sees someone else takes the task but moves on:

{quote}
2011-12-20 03:05:09,182 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 preempted from sv4r6s38,62023,1324345935082, current task state and owner=unassigned sv4r11s38,62003,1324340331847
2011-12-20 03:05:09,194 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 preempted from sv4r6s38,62023,1324345935082, current task state and owner=owned sv4r30s44,62023,1324345935039
2011-12-20 03:05:10,841 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 preempted from sv4r6s38,62023,1324345935082, current task state and owner=owned sv4r30s44,62023,1324345935039
2011-12-20 03:05:15,089 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:9100/hbase/splitlog/sv4r6s38,62023,1324345935082_hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814/TestTable/e710a55d352791b42ea6d588ed38e934/recovered.edits/0000000000000031401.temp, syncFs=true, hflush=false
{quote}

This happens another time until it fails to assert ownership of the znode:

{quote}
2011-12-20 03:06:17,256 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: NONODE failed to assert ownership for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
2011-12-20 03:06:17,256 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
2011-12-20 03:06:17,302 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed hdfs://sv4r11s38:9100/hbase/splitlog/sv4r6s38,62023,1324345935082_hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814/TestTable/12c66fe6163486f5d1d1b6a2b7a5a474/recovered.edits/0000000000000031403.temp
...
{quote}

Like a honey badger it doesn't really care and continues closing files until it reaches a point where it dies with the stack posted in this jira's description.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173981#comment-13173981 ] 

Hudson commented on HBASE-5077:
-------------------------------

Integrated in HBase-0.92-security #47 (See [https://builds.apache.org/job/HBase-0.92-security/47/])
    HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS -- fix compile error
HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS

stack : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java

stack : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java

                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077-v4.txt, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5077:
-------------------------

    Attachment: HBASE-5077-v4.txt

This patch actually compiles.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077-v4.txt, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173577#comment-13173577 ] 

Zhihong Yu commented on HBASE-5077:
-----------------------------------

To answer J-D's question, let me reference the following code from taskLoop():
{code}
          } catch (InterruptedException e) {
            LOG.info("SplitLogWorker interrupted while waiting for task," +
              " exiting: " + e.toString());
            assert exitWorker == true;
            return;
          }
{code}
where exitWorker was expected to be true. I think the assertion wasn't triggered at runtime.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174035#comment-13174035 ] 

Hudson commented on HBASE-5077:
-------------------------------

Integrated in HBase-TRUNK-security #39 (See [https://builds.apache.org/job/HBase-TRUNK-security/39/])
    HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS -- fix compile error
HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS

stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java

stack : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java

                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077-v4.txt, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-5077:
--------------------------------------

    Status: Open  (was: Patch Available)
    
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173575#comment-13173575 ] 

Jean-Daniel Cryans commented on HBASE-5077:
-------------------------------------------

One problem I see is that in HLogSplitter.splitLogFileToTemp we do this in the finally:

{quote}
if ((progress_failed == false) && (reporter != null) &&
    (reporter.progress() == false)) {
  progress_failed = true;
}
{quote}

But at this point progress_failed isn't taken into account so the method returns true. Looking at other parts of that method it seems it's missing a "return false" which would be correctly handled by SplitLogWorker.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173664#comment-13173664 ] 

Zhihong Yu commented on HBASE-5077:
-----------------------------------

Patch looks good.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>         Attachments: HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-5077:
--------------------------------------

    Attachment: HBASE-5077-v2.patch

Forgot about the finally block, changing it again to basically add the correct return and print if process_failed is false.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173685#comment-13173685 ] 

Hadoop QA commented on HBASE-5077:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508163/HBASE-5077.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -152 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.io.TestHeapSize

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/558//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/558//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/558//console

This message is automatically generated.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>         Attachments: HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5077:
-------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Committed branch and trunk.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-5077:
--------------------------------------

    Fix Version/s:     (was: 0.92.1)
                   0.92.0
           Status: Patch Available  (was: Open)
    
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173879#comment-13173879 ] 

Hudson commented on HBASE-5077:
-------------------------------

Integrated in HBase-TRUNK #2565 (See [https://builds.apache.org/job/HBase-TRUNK/2565/])
    HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS -- fix compile error
HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS

stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java

stack : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java

                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077-v4.txt, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173832#comment-13173832 ] 

stack commented on HBASE-5077:
------------------------------

+1 on patch for branch and trunk.
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173933#comment-13173933 ] 

Hudson commented on HBASE-5077:
-------------------------------

Integrated in HBase-0.92 #205 (See [https://builds.apache.org/job/HBase-0.92/205/])
    HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS -- fix compile error
HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS

stack : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java

stack : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java

                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: HBASE-5077-v2.patch, HBASE-5077-v4.txt, HBASE-5077.patch
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS

Posted by "Jean-Daniel Cryans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173592#comment-13173592 ] 

Jean-Daniel Cryans commented on HBASE-5077:
-------------------------------------------

I now understand how I got all the way to closing the files without aborting the splitting, the interrupt is being retried by the DFSClient:

{quote}

2011-12-20 03:05:09,194 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 preempted from sv4r6s38,62023,1324345935082, current task state and owner=owned sv4r30s44,62023,1324345935039
2011-12-20 03:05:09,194 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
2011-12-20 03:05:09,214 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.4.28.44:51010, add to deadNodes and continue
java.nio.channels.ClosedByInterruptException
        at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:511)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockReader(DFSClient.java:2354)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2033)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.seekToBlockSource(DFSClient.java:2483)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2119)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2150)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getNextLogLine(HLogSplitter.java:764)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:402)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:351)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:266)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
        at java.lang.Thread.run(Thread.java:662)
2011-12-20 03:05:09,216 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.4.12.38:51010, add to deadNodes and continue
java.nio.channels.ClosedByInterruptException
        at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
...
2011-12-20 03:05:09,220 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.4.14.38:51010, add to deadNodes and continue
java.nio.channels.ClosedByInterruptException
        at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
...
2011-12-20 03:05:09,223 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_2118163224139708562_43382 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
{quote}
                
> SplitLogWorker fails to let go of a task, kills the RS
> ------------------------------------------------------
>
>                 Key: HBASE-5077
>                 URL: https://issues.apache.org/jira/browse/HBASE-5077
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.92.1
>
>
> I hope I didn't break spacetime continuum, I got this while testing 0.92.0:
> {quote}
> 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> {quote}
> I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira