You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/05/25 22:39:47 UTC
[jira] [Assigned] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment

     [ https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans reassigned HBASE-3874:
-----------------------------------------

    Assignee: Jean-Daniel Cryans

> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-3874
>                 URL: https://issues.apache.org/jira/browse/HBASE-3874
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3874-trunk.patch, HBASE-3874.patch
>
>
> By chance, we were able to revert the ulimit on one of our clusters to 1024 and it started dying non-stop on "Too many open files". Now the bad thing is that some region servers weren't completely ServerShutdownHandler'd because they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
> 	at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
> 	at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
>   // Use iterator's remove else we'll get CME
>   i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it seems we instantiate a RegionPlan with a null HSI when it's a random assignment. 
> It means that if there's a random assignment going on while a node dies then this issue might happen.
> Initially I thought that this could mean data loss, but the logs are already split so it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the balancer didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which removes the RS from the dead list if it comes back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira