You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2010/04/30 09:03:54 UTC

[jira] Resolved: (HBASE-2428) NPE in ProcessRegionClose because meta is offline kills master and thus the cluster

     [ https://issues.apache.org/jira/browse/HBASE-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-2428.
--------------------------

      Assignee: stack
    Resolution: Fixed

The patch ocmmitted over in hbase-2414 had fixes for this issue and a unit test.  The patch there is big.  Here are the two pieces of the patch that in particular fixed this issue:

In HMaster process todo queue, when op.process failed... we'd fall into this block:
{code}
+    } catch (Exception ex) {
+      // There was an exception performing the operation.
+      if (ex instanceof RemoteException) {
+        try {
+          ex = RemoteExceptionHandler.decodeRemoteException(
+            (RemoteException)ex);
+        } catch (IOException e) {
+          ex = e;
+          LOG.warn("main processing loop: " + op.toString(), e);
+        }
+      }
+      LOG.warn("Failed processing: " + op.toString() +
+        "; putting onto delayed todo queue", ex);
+      putOnDelayQueue(op);
...
{code}

The fix is new method putOnDelayQueue(op).  Before we used to just add it direct to the delay queue.   This method first resets the RegionServerOperation expiration to some time in the future.  Previous it wasn't being reset so it just didn't stay in the delay queue.  It came straight out.

Here is part of fix that got rid of the NPEs.
{code}
Index: src/java/org/apache/hadoop/hbase/master/ProcessRegionClose.java
===================================================================
--- src/java/org/apache/hadoop/hbase/master/ProcessRegionClose.java	(revision 939172)
+++ src/java/org/apache/hadoop/hbase/master/ProcessRegionClose.java	(working copy)
@@ -58,6 +58,13 @@
 
   @Override
   protected boolean process() throws IOException {
+    if (!metaRegionAvailable()) {
+      // We can't proceed unless the meta region we are going to update
+      // is online. metaRegionAvailable() has put this operation on the
+      // delayedToDoQueue, so return true so the operation is not put
+      // back on the toDoQueue
+      return true;
+    }
     Boolean result = null;
     if (offlineRegion || reassignRegion) {
       result =
{code}
{code}

> NPE in ProcessRegionClose because meta is offline kills master and thus the cluster
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2428
>                 URL: https://issues.apache.org/jira/browse/HBASE-2428
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.20.5
>
>
> This issue was born of study done in hbase-2413.  The meta went offline and we were processing a region close at the same time.  The close processing fell into a NPE loop and wouldn't get out of it killing master and effectivly killing the cluster:
> {code}
> 2010-03-31 17:50:57,004 INFO org.apache.hadoop.hbase.master.ServerManager: hbasetest020.X.X.X,60020,1270077892989 znode expired
> 2010-03-31 17:50:57,004 INFO org.apache.hadoop.hbase.master.RegionManager: META region removed from onlineMetaRegions
> 2010-03-31 17:51:15,385 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest020.X.X.X,60020,1270083075377
> 2010-03-31 17:51:15,399 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1270083075377 with data 10.18.35.215:60020
> 2010-03-31 17:51:15,870 DEBUG org.apache.hadoop.hbase.master.RegionManager: Server is overloaded: load=5, avg=3.0, slop=0.3
> 2010-03-31 17:51:15,870 DEBUG org.apache.hadoop.hbase.master.RegionManager: Choosing to reassign 2 regions. mostLoadedRegions has 5 regions in it.
> 2010-03-31 17:51:15,870 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region test1,3147000000,1270081876965
> 2010-03-31 17:51:15,870 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region test1,9352000000,1270080893514
> 2010-03-31 17:51:15,870 INFO org.apache.hadoop.hbase.master.RegionManager: Skipped 0 region(s) that are in transition states
> 2010-03-31 17:51:15,878 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_CLOSE: test1,3147000000,1270081876965 from hbasetest019.X.X.X,60020,1270082983630; 1 of 2
> 2010-03-31 17:51:15,879 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_CLOSE: test1,9352000000,1270080893514 from hbasetest019.X.X.X,60020,1270082983630; 2 of 2
> 2010-03-31 17:51:44,897 DEBUG org.apache.hadoop.hbase.master.ProcessRegionClose$1: Trying to contact region server for regionName '.META.,,1', but failed after 10 attempts.
> Exception 1:
> java.io.IOException: Call to /10.18.35.215:60020 failed on local exception: java.io.EOFExceptionException 1:
> java.net.ConnectException: Connection refusedException 1:
> java.net.ConnectException: Connection refusedException 1:
> java.net.ConnectException: Connection refusedException 1:
> java.net.ConnectException: Connection refusedException 1:
> java.net.ConnectException: Connection refusedException 1:
> java.net.ConnectException: Connection refusedException 1:
> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2282)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:1989)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:577)
>         at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> Exception 1:
> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2282)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:1989)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:577)
>         at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2282)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:1989)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:577)
>         at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>         at org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:74)
>         at org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:63)
>         at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:494)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:429)
> 2010-03-31 17:51:44,899 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose of test1,1230000000,1270081673808, false, reassign: true
> 2010-03-31 17:51:44,899 DEBUG org.apache.hadoop.hbase.master.ProcessRegionClose$1: Exception in RetryableMetaOperation:
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:64)
>         at org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:63)
>         at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:494)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:429)
> 2010-03-31 17:51:44,900 WARN org.apache.hadoop.hbase.master.HMaster: Processing pending operations: ProcessRegionClose of test1,1230000000,1270081673808, false, reassign: true
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:96)
>         at org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:63)
>         at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:494)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:429)
> Caused by: java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:64)
>         ... 3 more
> {code}
> ... and so on.
> Marking a blocker for 0.20.5.
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.