You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/05/11 02:13:47 UTC

[jira] [Created] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment

ServerShutdownHandler fails on NPE if a plan has a random region assignment
---------------------------------------------------------------------------

                 Key: HBASE-3874
                 URL: https://issues.apache.org/jira/browse/HBASE-3874
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.2
            Reporter: Jean-Daniel Cryans
            Priority: Blocker
             Fix For: 0.90.4


By chance, we were able to revert the ulimit on one of our clusters to 1024 and it started dying non-stop on "Too many open files". Now the bad thing is that some region servers weren't completely ServerShutdownHandler'd because they failed on:

{quote}

2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.lang.NullPointerException
	at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
	at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
{quote}

Reading the code, it seems the NPE is in the if statement:

{code}
Map.Entry<String, RegionPlan> e = i.next();
if (e.getValue().getDestination().equals(hsi)) {
  // Use iterator's remove else we'll get CME
  i.remove();
}
{code}

Which means that the destination (HSI) is null. Looking through the code, it seems we instantiate a RegionPlan with a null HSI when it's a random assignment. 

It means that if there's a random assignment going on while a node dies then this issue might happen.

Initially I thought that this could mean data loss, but the logs are already split so it's just the reassignment that doesn't happen (still bad).

Also it left the master with dead server being processed, so for two days the balancer didn't run failing on:

bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): []

And the reason why the array is empty is because we are running 0.90.3 which removes the RS from the dead list if it comes back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035847#comment-13035847 ] 

stack commented on HBASE-3874:
------------------------------

+1 on patch.

I spent a few minutes trying to see if I could construct an AssignmentManager standalone so we could write a unit test but its a pain.  It takes a ServerManager, etc.

> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-3874
>                 URL: https://issues.apache.org/jira/browse/HBASE-3874
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3874.patch
>
>
> By chance, we were able to revert the ulimit on one of our clusters to 1024 and it started dying non-stop on "Too many open files". Now the bad thing is that some region servers weren't completely ServerShutdownHandler'd because they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
> 	at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
> 	at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
>   // Use iterator's remove else we'll get CME
>   i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it seems we instantiate a RegionPlan with a null HSI when it's a random assignment. 
> It means that if there's a random assignment going on while a node dies then this issue might happen.
> Initially I thought that this could mean data loss, but the logs are already split so it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the balancer didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which removes the RS from the dead list if it comes back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-3874:
--------------------------------------

    Attachment: HBASE-3874-trunk.patch

Patch for trunk.

> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-3874
>                 URL: https://issues.apache.org/jira/browse/HBASE-3874
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3874-trunk.patch, HBASE-3874.patch
>
>
> By chance, we were able to revert the ulimit on one of our clusters to 1024 and it started dying non-stop on "Too many open files". Now the bad thing is that some region servers weren't completely ServerShutdownHandler'd because they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
> 	at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
> 	at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
>   // Use iterator's remove else we'll get CME
>   i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it seems we instantiate a RegionPlan with a null HSI when it's a random assignment. 
> It means that if there's a random assignment going on while a node dies then this issue might happen.
> Initially I thought that this could mean data loss, but the logs are already split so it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the balancer didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which removes the RS from the dead list if it comes back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-3874:
--------------------------------------

    Attachment: HBASE-3874.patch

Adding a check if HSI is null. I looked into writing a unit test but it's not obvious how to create a random plan for a region without having the actual reassignment taking place.

> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-3874
>                 URL: https://issues.apache.org/jira/browse/HBASE-3874
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3874.patch
>
>
> By chance, we were able to revert the ulimit on one of our clusters to 1024 and it started dying non-stop on "Too many open files". Now the bad thing is that some region servers weren't completely ServerShutdownHandler'd because they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
> 	at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
> 	at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
>   // Use iterator's remove else we'll get CME
>   i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it seems we instantiate a RegionPlan with a null HSI when it's a random assignment. 
> It means that if there's a random assignment going on while a node dies then this issue might happen.
> Initially I thought that this could mean data loss, but the logs are already split so it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the balancer didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which removes the RS from the dead list if it comes back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans reassigned HBASE-3874:
-----------------------------------------

    Assignee: Jean-Daniel Cryans

> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-3874
>                 URL: https://issues.apache.org/jira/browse/HBASE-3874
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3874-trunk.patch, HBASE-3874.patch
>
>
> By chance, we were able to revert the ulimit on one of our clusters to 1024 and it started dying non-stop on "Too many open files". Now the bad thing is that some region servers weren't completely ServerShutdownHandler'd because they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
> 	at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
> 	at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
>   // Use iterator's remove else we'll get CME
>   i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it seems we instantiate a RegionPlan with a null HSI when it's a random assignment. 
> It means that if there's a random assignment going on while a node dies then this issue might happen.
> Initially I thought that this could mean data loss, but the logs are already split so it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the balancer didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which removes the RS from the dead list if it comes back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans resolved HBASE-3874.
---------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed to branch and trunk, thanks for the review Stack!

> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-3874
>                 URL: https://issues.apache.org/jira/browse/HBASE-3874
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3874-trunk.patch, HBASE-3874.patch
>
>
> By chance, we were able to revert the ulimit on one of our clusters to 1024 and it started dying non-stop on "Too many open files". Now the bad thing is that some region servers weren't completely ServerShutdownHandler'd because they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
> 	at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
> 	at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
>   // Use iterator's remove else we'll get CME
>   i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it seems we instantiate a RegionPlan with a null HSI when it's a random assignment. 
> It means that if there's a random assignment going on while a node dies then this issue might happen.
> Initially I thought that this could mean data loss, but the logs are already split so it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the balancer didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which removes the RS from the dead list if it comes back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037173#comment-13037173 ] 

Hudson commented on HBASE-3874:
-------------------------------

Integrated in HBase-TRUNK #1930 (See [https://builds.apache.org/hudson/job/HBase-TRUNK/1930/])
    

> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-3874
>                 URL: https://issues.apache.org/jira/browse/HBASE-3874
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3874-trunk.patch, HBASE-3874.patch
>
>
> By chance, we were able to revert the ulimit on one of our clusters to 1024 and it started dying non-stop on "Too many open files". Now the bad thing is that some region servers weren't completely ServerShutdownHandler'd because they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
> 	at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
> 	at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
>   // Use iterator's remove else we'll get CME
>   i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it seems we instantiate a RegionPlan with a null HSI when it's a random assignment. 
> It means that if there's a random assignment going on while a node dies then this issue might happen.
> Initially I thought that this could mean data loss, but the logs are already split so it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the balancer didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which removes the RS from the dead list if it comes back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira