You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Eugene Koontz (Created) (JIRA)" <ji...@apache.org> on 2012/01/15 19:23:39 UTC

[jira] [Created] (HBASE-5202) NPE in master.AssignmentManager.regionOnline()

NPE in master.AssignmentManager.regionOnline()
----------------------------------------------

                 Key: HBASE-5202
                 URL: https://issues.apache.org/jira/browse/HBASE-5202
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.6
            Reporter: Eugene Koontz
            Assignee: Eugene Koontz


The following NPE can occur during master failover:

{code}
2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
java.lang.NullPointerException
        at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
        at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
        at java.lang.Thread.run(Thread.java:636)
{code}

This is caused by regionOnline() being passed a null serverInfo (its second parameter). 

The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:

{code}
hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
{code}

and
 
{code}
hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
{code}

getHServerInfo(), is defined as:

{code}
  public HServerInfo getHServerInfo(final HServerAddress hsa) {
    synchronized(this.onlineServers) {
      // TODO: This is primitive.  Do a better search.
      for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
        if (e.getValue().getServerAddress().equals(hsa)) {
          return e.getValue();
        }
      }
    }
    return null;
  }
{code}

This can return null because the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 

Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).

The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 

The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186601#comment-13186601 ] 

Zhihong Yu commented on HBASE-5202:
-----------------------------------

@Eugene:
I got the following when I tried to apply HBASE-5202.patch:
{code}
2 out of 3 hunks FAILED -- saving rejects to file src/main/java/org/apache/hadoop/hbase/master/HMaster.java.rej
patching file src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Hunk #1 succeeded at 608 (offset -90 lines).
patching file src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
Hunk #1 succeeded at 21 with fuzz 1.
Hunk #2 FAILED at 35.
Hunk #3 succeeded at 50 with fuzz 2 (offset -6 lines).
Hunk #4 FAILED at 953.
2 out of 4 hunks FAILED -- saving rejects to file src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java.rej
{code}
Can you provide a new patch ?

Normally if a patch is accepted by Hadoop QA, we should only need to rerun the tests reported as failed by Hadoop QA.

Thanks for working over the weekend.
                
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "Eugene Koontz (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koontz updated HBASE-5202:
---------------------------------

    Description: 
The following NPE can occur during master failover:

{code}
2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
java.lang.NullPointerException
        at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
        at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
        at java.lang.Thread.run(Thread.java:636)
{code}

This is caused by regionOnline() being passed a null serverInfo (its second parameter). 

The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:

{code}
hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
{code}

and
 
{code}
hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
{code}

getHServerInfo(), is defined as:

{code}
  public HServerInfo getHServerInfo(final HServerAddress hsa) {
    synchronized(this.onlineServers) {
      // TODO: This is primitive.  Do a better search.
      for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
        if (e.getValue().getServerAddress().equals(hsa)) {
          return e.getValue();
        }
      }
    }
    return null;
  }
{code}

This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 

Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).

The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 

The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.


  was:
The following NPE can occur during master failover:

{code}
2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
java.lang.NullPointerException
        at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
        at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
        at java.lang.Thread.run(Thread.java:636)
{code}

This is caused by regionOnline() being passed a null serverInfo (its second parameter). 

The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:

{code}
hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
{code}

and
 
{code}
hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
{code}

getHServerInfo(), is defined as:

{code}
  public HServerInfo getHServerInfo(final HServerAddress hsa) {
    synchronized(this.onlineServers) {
      // TODO: This is primitive.  Do a better search.
      for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
        if (e.getValue().getServerAddress().equals(hsa)) {
          return e.getValue();
        }
      }
    }
    return null;
  }
{code}

This can return null because the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 

Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).

The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 

The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.


    
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo(), is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5202) NPE in master.AssignmentManager.regionOnline()

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186559#comment-13186559 ] 

Hadoop QA commented on HBASE-5202:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510634/HBASE-5202.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/763//console

This message is automatically generated.
                
> NPE in master.AssignmentManager.regionOnline()
> ----------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo(), is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This can return null because the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "Eugene Koontz (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koontz updated HBASE-5202:
---------------------------------

    Description: 
The following NPE can occur during master failover:

{code}
2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
java.lang.NullPointerException
        at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
        at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
        at java.lang.Thread.run(Thread.java:636)
{code}

This is caused by regionOnline() being passed a null serverInfo (its second parameter). 

The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:

{code}
hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
{code}

and
 
{code}
hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
{code}

getHServerInfo() is defined as:

{code}
  public HServerInfo getHServerInfo(final HServerAddress hsa) {
    synchronized(this.onlineServers) {
      // TODO: This is primitive.  Do a better search.
      for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
        if (e.getValue().getServerAddress().equals(hsa)) {
          return e.getValue();
        }
      }
    }
    return null;
  }
{code}

This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 

Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).

The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 

The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.


  was:
The following NPE can occur during master failover:

{code}
2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
java.lang.NullPointerException
        at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
        at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
        at java.lang.Thread.run(Thread.java:636)
{code}

This is caused by regionOnline() being passed a null serverInfo (its second parameter). 

The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:

{code}
hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
{code}

and
 
{code}
hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
{code}

getHServerInfo(), is defined as:

{code}
  public HServerInfo getHServerInfo(final HServerAddress hsa) {
    synchronized(this.onlineServers) {
      // TODO: This is primitive.  Do a better search.
      for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
        if (e.getValue().getServerAddress().equals(hsa)) {
          return e.getValue();
        }
      }
    }
    return null;
  }
{code}

This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 

Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).

The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 

The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.


    
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5202) NPE in master.AssignmentManager.regionOnline()

Posted by "Eugene Koontz (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koontz updated HBASE-5202:
---------------------------------

    Attachment: testMasterFailoverWithSlowRS.txt

patch to TestMasterFailover.java to cause NPE.
                
> NPE in master.AssignmentManager.regionOnline()
> ----------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo(), is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This can return null because the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "Eugene Koontz (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koontz updated HBASE-5202:
---------------------------------

    Summary: NPE during Master failover in master.AssignmentManager.regionOnline()  (was: NPE in master.AssignmentManager.regionOnline())
    
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo(), is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This can return null because the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5202) NPE in master.AssignmentManager.regionOnline()

Posted by "Eugene Koontz (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koontz updated HBASE-5202:
---------------------------------

    Attachment: HBASE-5202.patch
    
> NPE in master.AssignmentManager.regionOnline()
> ----------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo(), is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This can return null because the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "Eugene Koontz (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186604#comment-13186604 ] 

Eugene Koontz commented on HBASE-5202:
--------------------------------------

Hi Zhihong, my patch is based on apache's 0.90 branch:

{code}
commit 508659a260579e4fecc3d45d3b112c7fd19d5a1f
Author: Ramkrishna S. Vasudevan <ra...@apache.org>
Date:   Sat Jan 14 18:44:33 2012 +0000

    HBASE-5155 ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation 
    
    
    git-svn-id: https://svn.apache.org/repos/asf/hbase/branches/0.90@1231555 13f79535-47bb-0310-9956-ffa45
{code}

on my github: https://github.com/ekoontz/hbase/commits/HBASE-5202
                
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187421#comment-13187421 ] 

gaojinchao commented on HBASE-5202:
-----------------------------------

look https://issues.apache.org/jira/browse/HBASE-5179. Maybe it can resolve this issue.
                
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5202) NPE in master.AssignmentManager.regionOnline()

Posted by "Eugene Koontz (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koontz updated HBASE-5202:
---------------------------------

    Status: Patch Available  (was: Open)
    
> NPE in master.AssignmentManager.regionOnline()
> ----------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo(), is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This can return null because the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186605#comment-13186605 ] 

Zhihong Yu commented on HBASE-5202:
-----------------------------------

The attached patch is for 0.90 branch.

@Eugene:
Can you prepare a patch for TRUNK ?
In TRUNK, this.regionServerTracker.getOnlineServers() returns List<ServerName>.
ServerManager.onlineServers is keyed by ServerName. So the lookup would be much faster.

For 0.90, I suggest passing List<HServerAddress> to serverManager.isServerOnline() so that we can iterate through the entries of ServerManager.onlineServers and find out if both {{-ROOT-}} and .META. servers are online.
                
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "Eugene Koontz (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186609#comment-13186609 ] 

Eugene Koontz commented on HBASE-5202:
--------------------------------------

Jinchao wrote (in [HBASE-5202|https://issues.apache.org/jira/browse/HBASE-3933?focusedCommentId=13186098&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13186098]):

{quote}
@Eugene
In your patches, You only deale with the root/meta regionserver. If a normal regionserver registers laterly.
Master will process it as a dead one. Some regions in the later one will be opened twice.
{quote}

Jinchao, can you explain this scenario more? Does my patch cause duplicate openings that could not happen before? Or are you saying that this patch does not fix the existing NPE described on HBASE-3933?
                
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186681#comment-13186681 ] 

gaojinchao commented on HBASE-5202:
-----------------------------------

The root reason of this issue is some region server register lately.

when one region server without META/ROOT registers atfer "rebuildUserRegions" finished. The regions in this one will be opened twice.

                
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186606#comment-13186606 ] 

Zhihong Yu commented on HBASE-5202:
-----------------------------------

Also, I think we should impose a timeout for verifyMetaTablesAreUp() so that we don't wait indefinitely.
                
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

Posted by "Eugene Koontz (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186596#comment-13186596 ] 

Eugene Koontz commented on HBASE-5202:
--------------------------------------

{{mvn clean test}} returns:

{code}
Results :

Tests run: 704, Failures: 0, Errors: 0, Skipped: 9

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:23:11.247s

{code}


Running {{mvn test -Dtest=TestMasterFailover}} runs 100+ times with no failures.
                
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second parameter). 
> The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira