You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2010/06/10 20:47:13 UTC

[jira] Created: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Can't recover from a dead ROOT server if any exceptions happens during log splitting
------------------------------------------------------------------------------------

                 Key: HBASE-2707
                 URL: https://issues.apache.org/jira/browse/HBASE-2707
             Project: HBase
          Issue Type: Bug
            Reporter: Jean-Daniel Cryans
            Assignee: Jean-Daniel Cryans
            Priority: Blocker
             Fix For: 0.21.0


There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.

Some logs:

{code}
2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
        at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
        at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
        at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
{code}

The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2707:
-------------------------

    Attachment: 2707-0.20.txt

Backport to 0.20.  Please review.  I'd like this to go into 0.20 since it hoses cluster if it ever happens.

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: 2707-0.20.txt, 2707-test.txt, HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877705#action_12877705 ] 

stack commented on HBASE-2707:
------------------------------

I think we need  a test for this because i can't see how -ROOT- is recovered if the code in processervershutdown#process is like this:

{code}
    if (!rootRescanned) {
      // Scan the ROOT region
      Boolean result = new ScanRootRegion(
          new MetaRegion(master.getRegionManager().getRootRegionLocation(),
              HRegionInfo.ROOT_REGIONINFO), this.master).doWithRetries();
      if (result == null) {
        // Master is closing - give up
        return true;
      }

      if (LOG.isDebugEnabled()) {
        LOG.debug("Process server shutdown scanning root region on " +
          master.getRegionManager().getRootRegionLocation().getBindAddress() +
          " finished " + Thread.currentThread().getName());
      }
      rootRescanned = true;
    }
{code}

If the current server being processed held the -ROOT- and the first thing we do processing a shutdown of a server that held root is to clear the root location, how is the above code succeeding?

Assigning myself to mess w/ a test.

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882638#action_12882638 ] 

Jean-Daniel Cryans commented on HBASE-2707:
-------------------------------------------

So actually the code of process is looks like:

{code}
LOG.info("Log split complete, meta reassignment and scanning:");
    if (this.isRootServer) {
      LOG.info("ProcessServerShutdown reassigning ROOT region");
      master.getRegionManager().reassignRootRegion();
      isRootServer = false;  // prevent double reassignment... heh.
    }

    for (MetaRegion metaRegion : metaRegions) {
      LOG.info("ProcessServerShutdown setting to unassigned: " + metaRegion.toString());
      master.getRegionManager().setUnassigned(metaRegion.getRegionInfo(), true);
    }
    // one the meta regions are online, "forget" about them.  Since there are explicit
    // checks below to make sure meta/root are online, this is likely to occur.
    metaRegions.clear();

    if (!rootAvailable()) {
      // Return true so that worker does not put this request back on the
      // toDoQueue.
      // rootAvailable() has already put it on the delayedToDoQueue
      return true;
    }

    if (!rootRescanned) {
      // Scan the ROOT region
      Boolean result = new ScanRootRegion(
          new MetaRegion(master.getRegionManager().getRootRegionLocation(),
              HRegionInfo.ROOT_REGIONINFO), this.master).doWithRetries();
      if (result == null) {
        // Master is closing - give up
        return true;
      }

      if (LOG.isDebugEnabled()) {
        LOG.debug("Process server shutdown scanning root region on " +
          master.getRegionManager().getRootRegionLocation().getBindAddress() +
          " finished " + Thread.currentThread().getName());
      }
      rootRescanned = true;
    }
{code}

So if the RS had -ROOT-, it will be reassigned right away and then the method returns if !rootAvailable. Later when we come back and root was assigned, process server shutdown will finish its job. This is how the code you pasted succeeds.

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2707:
-------------------------

    Attachment: 2707-test.txt

Test that puts off the processing of the shutdown of the server that was carrying root.  This test never completes.  With the patch in place, it does.

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: 2707-test.txt, HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-2707:
--------------------------------------

    Attachment: HBASE-2707.patch

Stack and I talked a lot about it, here's what we came up with. It's very hard for me to come up with a unit test since it's all deep in the master and very much time-based, but I tested the patch with TestReplication a lot and 1) it doesn't fail anymore and 2) I see in the logs that the master does the right thing.

Should I commit this?

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882653#action_12882653 ] 

stack commented on HBASE-2707:
------------------------------

Hmm... chatted with J-D and he points out that the above runs AFTER logs are split so I had it incorrect.  Above should be good.

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack reassigned HBASE-2707:
----------------------------

    Assignee: stack  (was: Jean-Daniel Cryans)

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2707:
-------------------------

    Attachment:     (was: 2707-v3.txt)

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 2707-0.20.txt, 2707-test.txt, HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2707:
-------------------------

    Attachment: 2707-v3.txt

Latest iteration.  What is currently in place is currently broke it turns out.  More on this later.

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 2707-0.20.txt, 2707-test.txt, 2707-v3.txt, HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2707:
-------------------------

    Fix Version/s:     (was: 0.20.6)

Taking out of 0.20.6.  Open separate issue.  The attached 0.20 patch is not enough.  The change would be more major than this patch presumes.

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: 2707-0.20.txt, 2707-test.txt, HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2707:
-------------------------

    Fix Version/s: 0.20.6

Moving into 0.20.6.

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.20.6, 0.21.0
>
>         Attachments: 2707-0.20.txt, 2707-test.txt, HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877706#action_12877706 ] 

stack commented on HBASE-2707:
------------------------------

Oh, regards the patch, I think it definetly an improvement over what was there before (what was there before was silly -- it made this bug that J-D filed).

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882776#action_12882776 ] 

Jean-Daniel Cryans commented on HBASE-2707:
-------------------------------------------

+1 but minor nit, why isn't DELAY private?

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: 2707-test.txt, HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882649#action_12882649 ] 

stack commented on HBASE-2707:
------------------------------

So its broken then?  We assign -ROOT- but don't recover its edits?

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-2707.
--------------------------

    Hadoop Flags: [Reviewed]
      Resolution: Fixed

Committed.  Thanks for review J-D (I removed DELAY altogether).

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: 2707-test.txt, HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.