You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Jim Kellerman (JIRA)" <ji...@apache.org> on 2007/12/04 22:57:43 UTC

[jira] Commented: (HADOOP-2338) [hbase] NPE in master server

    [ https://issues.apache.org/jira/browse/HADOOP-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548427 ] 

Jim Kellerman commented on HADOOP-2338:
---------------------------------------

What happened:

Master is starting up and successfully assigns root and meta regions and processes the responses.

As the master is scanning the meta, it is finding stale serverinfo, splitting the old logs and assigning regions.

The first open report comes in and is queued up for the main thread. It tries to process the request but can't
because the meta region has not been completely scanned. The main thread then starts starving the rest of
the master by putting the request back on the queue and taking it off.

Needless to say, this slows down processing of the meta region considerably and prevents the master from responding to
heartbeat messages.

The master completes its first scan of the meta table, and starts its second pass, enough time has elapsed
that the master thinks that the first server it assigned the region to either didn't get the message or died, so
the master assigns the region to another server (the one that can't get a reply back to the master for 67 seconds).

The master finally processes the original open response and updates the meta.

A while later, the master assigns the same region again to yet a third server. It updates the meta again with the third server's
information.

It is quite understandable that in this chaos, the region would no longer appear in the assignAttempts map causing the 
NPE in the master.

Recommendations:

- Do not assign any user regions until all the meta regions have been scanned once.
- If we assume that message delivery is reliable, we don't need the assignAttempts map because if we don't hear
  back from the server we assign a region to, its lease will expire and we could reassign the region at that time.




> [hbase] NPE in master server
> ----------------------------
>
>                 Key: HADOOP-2338
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2338
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.16.0
>
>         Attachments: master.log.gz
>
>
> Master gets an NPE after receiving multiple responses from the same server telling the master it has opened a region.
> {code}
> 2007-12-02 20:31:37,515 DEBUG hbase.HRegion - Next sequence id for region postlog,img254/577/02suecia024richardburnson0.jpg,1196619667879 is 73377537
> 2007-12-02 20:31:37,517 INFO  hbase.HRegion - region postlog,img254/577/02suecia024richardburnson0.jpg,1196619667879 available
> 2007-12-02 20:31:39,200 WARN  hbase.HRegionServer - Processing message (Retry: 0)
> java.io.IOException: java.io.IOException: java.lang.NullPointerException
>     at org.apache.hadoop.hbase.HMaster.processMsgs(HMaster.java :1484)
>     at org.apache.hadoop.hbase.HMaster.regionServerReport(HMaster.java:1423)
>     at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java :25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0 (Native Method)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java :27)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>     at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
>     at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException (RemoteExceptionHandler.java:48)
>     at org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:759)
>     at java.lang.Thread.run(Thread.java:619)
>       case HMsg.MSG_REPORT_PROCESS_OPEN:
>         synchronized ( this.assignAttempts) {
>           // Region server has acknowledged request to open region.
>           // Extend region open time by 1/2 max region open time.
> **1484**          assignAttempts.put(region.getRegionName (), 
>               Long.valueOf(assignAttempts.get(
>                   region.getRegionName()).longValue() +
>                   (this.maxRegionOpenTime / 2)));
>         }
>         break;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.