You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "James Kennedy (JIRA)" <ji...@apache.org> on 2011/01/26 01:36:43 UTC

[jira] Created: (HBASE-3478) HBase fails to recover from failed DNS resolution of stale meta connection info

HBase fails to recover from failed DNS resolution of stale meta connection info
-------------------------------------------------------------------------------

                 Key: HBASE-3478
                 URL: https://issues.apache.org/jira/browse/HBASE-3478
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 0.90.1
            Reporter: James Kennedy
             Fix For: 0.90.1


This looks like a variant of HBASE-3445:

One of our developers ran a seed program with configuration A to generate some test data on his local machine. He then moved that data into a development environment on the same machine with a different hbase configuration B.

On startup the HMaster waits for new regionserver to register itself:

[25/01/11 15:37:25] 162161 [  HRegionServer] INFO  ase.regionserver.HRegionServer  - Telling master at 10.0.1.4:7801 that we are up
[25/01/11 15:37:25] 162165 [ice-EventThread] DEBUG .hadoop.hbase.zookeeper.ZKUtil  - master:7801-0x12dbf879abe0000 Retrieved 13 byte(s) of data from znode /hbase/rs/10.0.1.4,7802,1295998613814 and set watcher; 10.0.1.4:7802

Then ROOT region comes online at the right place: 10.0.1.4,7802

[25/01/11 15:37:31] 168369 [yTasks:70236052] INFO  ase.catalog.RootLocationEditor  - Setting ROOT region location in ZooKeeper as 10.0.1.4:7802
3:57 [25/01/11 15:37:31] 168408 [10.0.1.4:7801-0] DEBUG er.handler.OpenedRegionHandler  - Opened region -ROOT-,,0.70236052 on 10.0.1.4,7802,1295998613814

But then HMaster chokes on the stale META region location.

[25/01/11 15:37:31] 168448 [        HMaster] ERROR he.hadoop.hbase.HServerAddress  - Could not resolve the DNS name of warren:60020
[25/01/11 15:37:31] 168448 [        HMaster] FATAL he.hadoop.hbase.master.HMaster  - Unhandled exception. Starting shutdown.
java.lang.IllegalArgumentException: Could not resolve the DNS name of warren:60020
   at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
   at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
   at org.apache.hadoop.hbase.catalog.MetaReader.readLocation(MetaReader.java:344)
   at org.apache.hadoop.hbase.catalog.MetaReader.readMetaLocation(MetaReader.java:281)
   at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:280)
   at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482)
   at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
   at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
   at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
   at java.lang.Thread.run(Thread.java:680)

First of all, we do not yet understand why in configuration A the RegionInfo resolved to "warren:60020" whereas in configuration B we get "10.0.1.4:7802".  The port numbers make sense but not the "warren" hostname. It's probably specific to Warren's mac environment somehow because no other developer gets this problem when doing the same thing.  "warren" isn't in his hosts file so that remains a mystery.

But irrespective of that, since the ports differ we expect the stale meta connection data to cause connection failure anyway.  Perhaps in the form of a SocketTimeoutException as in hbase-3445.

But shouldn't the HMaster handle that by catching the exception and letting verifyMetaRegionLocation() fail so that meta regions get reassigned to the new region server?

Probably the safeguards in CatalogTracker.getCachedConnection() should move up to getMetaServerConnection() so as to encompass MetaReader.readMetaLocation() also. Essentially if getMetaServerConnection() encounters ANY exception connection to meta RegionServer it should probably just return null to force meta region reassignment.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3478) HBase fails to recover from failed DNS resolution of stale meta connection info

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992067#comment-12992067 ] 

stack commented on HBASE-3478:
------------------------------

This looks like something hbase-3446 would fix.  What you think James?  Rather than return raw Interface up out of readMetaLocation, we'll return an HTable.

> HBase fails to recover from failed DNS resolution of stale meta connection info
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3478
>                 URL: https://issues.apache.org/jira/browse/HBASE-3478
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1
>            Reporter: James Kennedy
>             Fix For: 0.90.1
>
>
> This looks like a variant of HBASE-3445:
> One of our developers ran a seed program with configuration A to generate some test data on his local machine. He then moved that data into a development environment on the same machine with a different hbase configuration B.
> On startup the HMaster waits for new regionserver to register itself:
> [25/01/11 15:37:25] 162161 [  HRegionServer] INFO  ase.regionserver.HRegionServer  - Telling master at 10.0.1.4:7801 that we are up
> [25/01/11 15:37:25] 162165 [ice-EventThread] DEBUG .hadoop.hbase.zookeeper.ZKUtil  - master:7801-0x12dbf879abe0000 Retrieved 13 byte(s) of data from znode /hbase/rs/10.0.1.4,7802,1295998613814 and set watcher; 10.0.1.4:7802
> Then ROOT region comes online at the right place: 10.0.1.4,7802
> [25/01/11 15:37:31] 168369 [yTasks:70236052] INFO  ase.catalog.RootLocationEditor  - Setting ROOT region location in ZooKeeper as 10.0.1.4:7802
> 3:57 [25/01/11 15:37:31] 168408 [10.0.1.4:7801-0] DEBUG er.handler.OpenedRegionHandler  - Opened region -ROOT-,,0.70236052 on 10.0.1.4,7802,1295998613814
> But then HMaster chokes on the stale META region location.
> [25/01/11 15:37:31] 168448 [        HMaster] ERROR he.hadoop.hbase.HServerAddress  - Could not resolve the DNS name of warren:60020
> [25/01/11 15:37:31] 168448 [        HMaster] FATAL he.hadoop.hbase.master.HMaster  - Unhandled exception. Starting shutdown.
> java.lang.IllegalArgumentException: Could not resolve the DNS name of warren:60020
>    at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
>    at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readLocation(MetaReader.java:344)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readMetaLocation(MetaReader.java:281)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:280)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482)
>    at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
>    at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
>    at java.lang.Thread.run(Thread.java:680)
> First of all, we do not yet understand why in configuration A the RegionInfo resolved to "warren:60020" whereas in configuration B we get "10.0.1.4:7802".  The port numbers make sense but not the "warren" hostname. It's probably specific to Warren's mac environment somehow because no other developer gets this problem when doing the same thing.  "warren" isn't in his hosts file so that remains a mystery.
> But irrespective of that, since the ports differ we expect the stale meta connection data to cause connection failure anyway.  Perhaps in the form of a SocketTimeoutException as in hbase-3445.
> But shouldn't the HMaster handle that by catching the exception and letting verifyMetaRegionLocation() fail so that meta regions get reassigned to the new region server?
> Probably the safeguards in CatalogTracker.getCachedConnection() should move up to getMetaServerConnection() so as to encompass MetaReader.readMetaLocation() also. Essentially if getMetaServerConnection() encounters ANY exception connection to meta RegionServer it should probably just return null to force meta region reassignment.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-3478) HBase fails to recover from failed DNS resolution of stale meta connection info

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-3478.
--------------------------

    Resolution: Fixed

> HBase fails to recover from failed DNS resolution of stale meta connection info
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3478
>                 URL: https://issues.apache.org/jira/browse/HBASE-3478
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1
>            Reporter: James Kennedy
>
> This looks like a variant of HBASE-3445:
> One of our developers ran a seed program with configuration A to generate some test data on his local machine. He then moved that data into a development environment on the same machine with a different hbase configuration B.
> On startup the HMaster waits for new regionserver to register itself:
> [25/01/11 15:37:25] 162161 [  HRegionServer] INFO  ase.regionserver.HRegionServer  - Telling master at 10.0.1.4:7801 that we are up
> [25/01/11 15:37:25] 162165 [ice-EventThread] DEBUG .hadoop.hbase.zookeeper.ZKUtil  - master:7801-0x12dbf879abe0000 Retrieved 13 byte(s) of data from znode /hbase/rs/10.0.1.4,7802,1295998613814 and set watcher; 10.0.1.4:7802
> Then ROOT region comes online at the right place: 10.0.1.4,7802
> [25/01/11 15:37:31] 168369 [yTasks:70236052] INFO  ase.catalog.RootLocationEditor  - Setting ROOT region location in ZooKeeper as 10.0.1.4:7802
> 3:57 [25/01/11 15:37:31] 168408 [10.0.1.4:7801-0] DEBUG er.handler.OpenedRegionHandler  - Opened region -ROOT-,,0.70236052 on 10.0.1.4,7802,1295998613814
> But then HMaster chokes on the stale META region location.
> [25/01/11 15:37:31] 168448 [        HMaster] ERROR he.hadoop.hbase.HServerAddress  - Could not resolve the DNS name of warren:60020
> [25/01/11 15:37:31] 168448 [        HMaster] FATAL he.hadoop.hbase.master.HMaster  - Unhandled exception. Starting shutdown.
> java.lang.IllegalArgumentException: Could not resolve the DNS name of warren:60020
>    at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
>    at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readLocation(MetaReader.java:344)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readMetaLocation(MetaReader.java:281)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:280)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482)
>    at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
>    at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
>    at java.lang.Thread.run(Thread.java:680)
> First of all, we do not yet understand why in configuration A the RegionInfo resolved to "warren:60020" whereas in configuration B we get "10.0.1.4:7802".  The port numbers make sense but not the "warren" hostname. It's probably specific to Warren's mac environment somehow because no other developer gets this problem when doing the same thing.  "warren" isn't in his hosts file so that remains a mystery.
> But irrespective of that, since the ports differ we expect the stale meta connection data to cause connection failure anyway.  Perhaps in the form of a SocketTimeoutException as in hbase-3445.
> But shouldn't the HMaster handle that by catching the exception and letting verifyMetaRegionLocation() fail so that meta regions get reassigned to the new region server?
> Probably the safeguards in CatalogTracker.getCachedConnection() should move up to getMetaServerConnection() so as to encompass MetaReader.readMetaLocation() also. Essentially if getMetaServerConnection() encounters ANY exception connection to meta RegionServer it should probably just return null to force meta region reassignment.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3478) HBase fails to recover from failed DNS resolution of stale meta connection info

Posted by "James Kennedy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992089#comment-12992089 ] 

James Kennedy commented on HBASE-3478:
--------------------------------------

I'm not 100% convinced.  Even if readRegionLocation is well-proofed and retries as hard as it can it is still one of several possible exception throwers within getMetaServerConnection(). hbase-3445 is an example of that where we had to plug a hole in the exception handling of getCachedConnection(). 

I think the greater point i was trying to make above is that robust exception handling needs to move up into getMetaServerConnection(). Assume you want to catch and log ALL exceptions and then maybe handle the exceptional exceptions that we actually DO want to have terminate HMaster fatally when it can't connect to a meta server, if any. 

> HBase fails to recover from failed DNS resolution of stale meta connection info
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3478
>                 URL: https://issues.apache.org/jira/browse/HBASE-3478
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1
>            Reporter: James Kennedy
>             Fix For: 0.90.1
>
>
> This looks like a variant of HBASE-3445:
> One of our developers ran a seed program with configuration A to generate some test data on his local machine. He then moved that data into a development environment on the same machine with a different hbase configuration B.
> On startup the HMaster waits for new regionserver to register itself:
> [25/01/11 15:37:25] 162161 [  HRegionServer] INFO  ase.regionserver.HRegionServer  - Telling master at 10.0.1.4:7801 that we are up
> [25/01/11 15:37:25] 162165 [ice-EventThread] DEBUG .hadoop.hbase.zookeeper.ZKUtil  - master:7801-0x12dbf879abe0000 Retrieved 13 byte(s) of data from znode /hbase/rs/10.0.1.4,7802,1295998613814 and set watcher; 10.0.1.4:7802
> Then ROOT region comes online at the right place: 10.0.1.4,7802
> [25/01/11 15:37:31] 168369 [yTasks:70236052] INFO  ase.catalog.RootLocationEditor  - Setting ROOT region location in ZooKeeper as 10.0.1.4:7802
> 3:57 [25/01/11 15:37:31] 168408 [10.0.1.4:7801-0] DEBUG er.handler.OpenedRegionHandler  - Opened region -ROOT-,,0.70236052 on 10.0.1.4,7802,1295998613814
> But then HMaster chokes on the stale META region location.
> [25/01/11 15:37:31] 168448 [        HMaster] ERROR he.hadoop.hbase.HServerAddress  - Could not resolve the DNS name of warren:60020
> [25/01/11 15:37:31] 168448 [        HMaster] FATAL he.hadoop.hbase.master.HMaster  - Unhandled exception. Starting shutdown.
> java.lang.IllegalArgumentException: Could not resolve the DNS name of warren:60020
>    at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
>    at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readLocation(MetaReader.java:344)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readMetaLocation(MetaReader.java:281)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:280)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482)
>    at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
>    at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
>    at java.lang.Thread.run(Thread.java:680)
> First of all, we do not yet understand why in configuration A the RegionInfo resolved to "warren:60020" whereas in configuration B we get "10.0.1.4:7802".  The port numbers make sense but not the "warren" hostname. It's probably specific to Warren's mac environment somehow because no other developer gets this problem when doing the same thing.  "warren" isn't in his hosts file so that remains a mystery.
> But irrespective of that, since the ports differ we expect the stale meta connection data to cause connection failure anyway.  Perhaps in the form of a SocketTimeoutException as in hbase-3445.
> But shouldn't the HMaster handle that by catching the exception and letting verifyMetaRegionLocation() fail so that meta regions get reassigned to the new region server?
> Probably the safeguards in CatalogTracker.getCachedConnection() should move up to getMetaServerConnection() so as to encompass MetaReader.readMetaLocation() also. Essentially if getMetaServerConnection() encounters ANY exception connection to meta RegionServer it should probably just return null to force meta region reassignment.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3478) HBase fails to recover from failed DNS resolution of stale meta connection info

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051396#comment-13051396 ] 

stack commented on HBASE-3478:
------------------------------

I meant to say, resolving as fixed by HBASE-3446 

> HBase fails to recover from failed DNS resolution of stale meta connection info
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3478
>                 URL: https://issues.apache.org/jira/browse/HBASE-3478
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1
>            Reporter: James Kennedy
>
> This looks like a variant of HBASE-3445:
> One of our developers ran a seed program with configuration A to generate some test data on his local machine. He then moved that data into a development environment on the same machine with a different hbase configuration B.
> On startup the HMaster waits for new regionserver to register itself:
> [25/01/11 15:37:25] 162161 [  HRegionServer] INFO  ase.regionserver.HRegionServer  - Telling master at 10.0.1.4:7801 that we are up
> [25/01/11 15:37:25] 162165 [ice-EventThread] DEBUG .hadoop.hbase.zookeeper.ZKUtil  - master:7801-0x12dbf879abe0000 Retrieved 13 byte(s) of data from znode /hbase/rs/10.0.1.4,7802,1295998613814 and set watcher; 10.0.1.4:7802
> Then ROOT region comes online at the right place: 10.0.1.4,7802
> [25/01/11 15:37:31] 168369 [yTasks:70236052] INFO  ase.catalog.RootLocationEditor  - Setting ROOT region location in ZooKeeper as 10.0.1.4:7802
> 3:57 [25/01/11 15:37:31] 168408 [10.0.1.4:7801-0] DEBUG er.handler.OpenedRegionHandler  - Opened region -ROOT-,,0.70236052 on 10.0.1.4,7802,1295998613814
> But then HMaster chokes on the stale META region location.
> [25/01/11 15:37:31] 168448 [        HMaster] ERROR he.hadoop.hbase.HServerAddress  - Could not resolve the DNS name of warren:60020
> [25/01/11 15:37:31] 168448 [        HMaster] FATAL he.hadoop.hbase.master.HMaster  - Unhandled exception. Starting shutdown.
> java.lang.IllegalArgumentException: Could not resolve the DNS name of warren:60020
>    at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
>    at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readLocation(MetaReader.java:344)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readMetaLocation(MetaReader.java:281)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:280)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482)
>    at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
>    at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
>    at java.lang.Thread.run(Thread.java:680)
> First of all, we do not yet understand why in configuration A the RegionInfo resolved to "warren:60020" whereas in configuration B we get "10.0.1.4:7802".  The port numbers make sense but not the "warren" hostname. It's probably specific to Warren's mac environment somehow because no other developer gets this problem when doing the same thing.  "warren" isn't in his hosts file so that remains a mystery.
> But irrespective of that, since the ports differ we expect the stale meta connection data to cause connection failure anyway.  Perhaps in the form of a SocketTimeoutException as in hbase-3445.
> But shouldn't the HMaster handle that by catching the exception and letting verifyMetaRegionLocation() fail so that meta regions get reassigned to the new region server?
> Probably the safeguards in CatalogTracker.getCachedConnection() should move up to getMetaServerConnection() so as to encompass MetaReader.readMetaLocation() also. Essentially if getMetaServerConnection() encounters ANY exception connection to meta RegionServer it should probably just return null to force meta region reassignment.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3478) HBase fails to recover from failed DNS resolution of stale meta connection info

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992099#comment-12992099 ] 

stack commented on HBASE-3478:
------------------------------

Your fix up in hbase-3445 was because the new class CatalogTracker was working with low-level HRegionInterface and for want of usage, it had not tripped all possible exceptions.  The hbase-3446 was about putting in place the wizened HTable; its seen it all so catches and retries the litany of possible exceptions.  I think HBASE-3446 is about your greater point of more robust exception handling.

Let me link this issue to that one and see if I can come up with a unit test that repros your exception above.

Thanks James.

> HBase fails to recover from failed DNS resolution of stale meta connection info
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3478
>                 URL: https://issues.apache.org/jira/browse/HBASE-3478
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1
>            Reporter: James Kennedy
>             Fix For: 0.90.1
>
>
> This looks like a variant of HBASE-3445:
> One of our developers ran a seed program with configuration A to generate some test data on his local machine. He then moved that data into a development environment on the same machine with a different hbase configuration B.
> On startup the HMaster waits for new regionserver to register itself:
> [25/01/11 15:37:25] 162161 [  HRegionServer] INFO  ase.regionserver.HRegionServer  - Telling master at 10.0.1.4:7801 that we are up
> [25/01/11 15:37:25] 162165 [ice-EventThread] DEBUG .hadoop.hbase.zookeeper.ZKUtil  - master:7801-0x12dbf879abe0000 Retrieved 13 byte(s) of data from znode /hbase/rs/10.0.1.4,7802,1295998613814 and set watcher; 10.0.1.4:7802
> Then ROOT region comes online at the right place: 10.0.1.4,7802
> [25/01/11 15:37:31] 168369 [yTasks:70236052] INFO  ase.catalog.RootLocationEditor  - Setting ROOT region location in ZooKeeper as 10.0.1.4:7802
> 3:57 [25/01/11 15:37:31] 168408 [10.0.1.4:7801-0] DEBUG er.handler.OpenedRegionHandler  - Opened region -ROOT-,,0.70236052 on 10.0.1.4,7802,1295998613814
> But then HMaster chokes on the stale META region location.
> [25/01/11 15:37:31] 168448 [        HMaster] ERROR he.hadoop.hbase.HServerAddress  - Could not resolve the DNS name of warren:60020
> [25/01/11 15:37:31] 168448 [        HMaster] FATAL he.hadoop.hbase.master.HMaster  - Unhandled exception. Starting shutdown.
> java.lang.IllegalArgumentException: Could not resolve the DNS name of warren:60020
>    at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
>    at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readLocation(MetaReader.java:344)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readMetaLocation(MetaReader.java:281)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:280)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482)
>    at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
>    at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
>    at java.lang.Thread.run(Thread.java:680)
> First of all, we do not yet understand why in configuration A the RegionInfo resolved to "warren:60020" whereas in configuration B we get "10.0.1.4:7802".  The port numbers make sense but not the "warren" hostname. It's probably specific to Warren's mac environment somehow because no other developer gets this problem when doing the same thing.  "warren" isn't in his hosts file so that remains a mystery.
> But irrespective of that, since the ports differ we expect the stale meta connection data to cause connection failure anyway.  Perhaps in the form of a SocketTimeoutException as in hbase-3445.
> But shouldn't the HMaster handle that by catching the exception and letting verifyMetaRegionLocation() fail so that meta regions get reassigned to the new region server?
> Probably the safeguards in CatalogTracker.getCachedConnection() should move up to getMetaServerConnection() so as to encompass MetaReader.readMetaLocation() also. Essentially if getMetaServerConnection() encounters ANY exception connection to meta RegionServer it should probably just return null to force meta region reassignment.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3478) HBase fails to recover from failed DNS resolution of stale meta connection info

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051196#comment-13051196 ] 

Todd Lipcon commented on HBASE-3478:
------------------------------------

This got fixed, right?

> HBase fails to recover from failed DNS resolution of stale meta connection info
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3478
>                 URL: https://issues.apache.org/jira/browse/HBASE-3478
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1
>            Reporter: James Kennedy
>
> This looks like a variant of HBASE-3445:
> One of our developers ran a seed program with configuration A to generate some test data on his local machine. He then moved that data into a development environment on the same machine with a different hbase configuration B.
> On startup the HMaster waits for new regionserver to register itself:
> [25/01/11 15:37:25] 162161 [  HRegionServer] INFO  ase.regionserver.HRegionServer  - Telling master at 10.0.1.4:7801 that we are up
> [25/01/11 15:37:25] 162165 [ice-EventThread] DEBUG .hadoop.hbase.zookeeper.ZKUtil  - master:7801-0x12dbf879abe0000 Retrieved 13 byte(s) of data from znode /hbase/rs/10.0.1.4,7802,1295998613814 and set watcher; 10.0.1.4:7802
> Then ROOT region comes online at the right place: 10.0.1.4,7802
> [25/01/11 15:37:31] 168369 [yTasks:70236052] INFO  ase.catalog.RootLocationEditor  - Setting ROOT region location in ZooKeeper as 10.0.1.4:7802
> 3:57 [25/01/11 15:37:31] 168408 [10.0.1.4:7801-0] DEBUG er.handler.OpenedRegionHandler  - Opened region -ROOT-,,0.70236052 on 10.0.1.4,7802,1295998613814
> But then HMaster chokes on the stale META region location.
> [25/01/11 15:37:31] 168448 [        HMaster] ERROR he.hadoop.hbase.HServerAddress  - Could not resolve the DNS name of warren:60020
> [25/01/11 15:37:31] 168448 [        HMaster] FATAL he.hadoop.hbase.master.HMaster  - Unhandled exception. Starting shutdown.
> java.lang.IllegalArgumentException: Could not resolve the DNS name of warren:60020
>    at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
>    at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readLocation(MetaReader.java:344)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readMetaLocation(MetaReader.java:281)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:280)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482)
>    at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
>    at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
>    at java.lang.Thread.run(Thread.java:680)
> First of all, we do not yet understand why in configuration A the RegionInfo resolved to "warren:60020" whereas in configuration B we get "10.0.1.4:7802".  The port numbers make sense but not the "warren" hostname. It's probably specific to Warren's mac environment somehow because no other developer gets this problem when doing the same thing.  "warren" isn't in his hosts file so that remains a mystery.
> But irrespective of that, since the ports differ we expect the stale meta connection data to cause connection failure anyway.  Perhaps in the form of a SocketTimeoutException as in hbase-3445.
> But shouldn't the HMaster handle that by catching the exception and letting verifyMetaRegionLocation() fail so that meta regions get reassigned to the new region server?
> Probably the safeguards in CatalogTracker.getCachedConnection() should move up to getMetaServerConnection() so as to encompass MetaReader.readMetaLocation() also. Essentially if getMetaServerConnection() encounters ANY exception connection to meta RegionServer it should probably just return null to force meta region reassignment.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3478) HBase fails to recover from failed DNS resolution of stale meta connection info

Posted by "James Kennedy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986792#action_12986792 ] 

James Kennedy commented on HBASE-3478:
--------------------------------------

Configuration A:

<property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:8701/hbase</value>
    </property>
    <property>
        <name>hbase.master.port</name>
        <value>60010</value>
    </property>
    <property>
        <name>hbase.regionserver.port</name>
        <value>60020</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.clientPort</name>
        <value>60030</value>
    </property>
<property>
        <name>hbase.regionserver.msginterval</name>
        <value>100</value>
        <description>Interval between messages from the RegionServer to HMaster
            in milliseconds.  Default is 15 sec. Set this value low if you want unit
            tests to be responsive.
        </description>
    </property>
    <property>
        <name>hbase.client.pause</name>
        <value>100</value>
    </property>

Configuration B:
<property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:7701/hbase</value>
    </property>
    <property>
        <name>hbase.master.port</name>
        <value>7801</value>
    </property>
    <property>
        <name>hbase.regionserver.port</name>
        <value>7802</value>
    </property>


> HBase fails to recover from failed DNS resolution of stale meta connection info
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3478
>                 URL: https://issues.apache.org/jira/browse/HBASE-3478
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1
>            Reporter: James Kennedy
>             Fix For: 0.90.1
>
>
> This looks like a variant of HBASE-3445:
> One of our developers ran a seed program with configuration A to generate some test data on his local machine. He then moved that data into a development environment on the same machine with a different hbase configuration B.
> On startup the HMaster waits for new regionserver to register itself:
> [25/01/11 15:37:25] 162161 [  HRegionServer] INFO  ase.regionserver.HRegionServer  - Telling master at 10.0.1.4:7801 that we are up
> [25/01/11 15:37:25] 162165 [ice-EventThread] DEBUG .hadoop.hbase.zookeeper.ZKUtil  - master:7801-0x12dbf879abe0000 Retrieved 13 byte(s) of data from znode /hbase/rs/10.0.1.4,7802,1295998613814 and set watcher; 10.0.1.4:7802
> Then ROOT region comes online at the right place: 10.0.1.4,7802
> [25/01/11 15:37:31] 168369 [yTasks:70236052] INFO  ase.catalog.RootLocationEditor  - Setting ROOT region location in ZooKeeper as 10.0.1.4:7802
> 3:57 [25/01/11 15:37:31] 168408 [10.0.1.4:7801-0] DEBUG er.handler.OpenedRegionHandler  - Opened region -ROOT-,,0.70236052 on 10.0.1.4,7802,1295998613814
> But then HMaster chokes on the stale META region location.
> [25/01/11 15:37:31] 168448 [        HMaster] ERROR he.hadoop.hbase.HServerAddress  - Could not resolve the DNS name of warren:60020
> [25/01/11 15:37:31] 168448 [        HMaster] FATAL he.hadoop.hbase.master.HMaster  - Unhandled exception. Starting shutdown.
> java.lang.IllegalArgumentException: Could not resolve the DNS name of warren:60020
>    at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
>    at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readLocation(MetaReader.java:344)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readMetaLocation(MetaReader.java:281)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:280)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482)
>    at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
>    at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
>    at java.lang.Thread.run(Thread.java:680)
> First of all, we do not yet understand why in configuration A the RegionInfo resolved to "warren:60020" whereas in configuration B we get "10.0.1.4:7802".  The port numbers make sense but not the "warren" hostname. It's probably specific to Warren's mac environment somehow because no other developer gets this problem when doing the same thing.  "warren" isn't in his hosts file so that remains a mystery.
> But irrespective of that, since the ports differ we expect the stale meta connection data to cause connection failure anyway.  Perhaps in the form of a SocketTimeoutException as in hbase-3445.
> But shouldn't the HMaster handle that by catching the exception and letting verifyMetaRegionLocation() fail so that meta regions get reassigned to the new region server?
> Probably the safeguards in CatalogTracker.getCachedConnection() should move up to getMetaServerConnection() so as to encompass MetaReader.readMetaLocation() also. Essentially if getMetaServerConnection() encounters ANY exception connection to meta RegionServer it should probably just return null to force meta region reassignment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Commented] (HBASE-3478) HBase fails to recover from failed DNS resolution of stale meta connection info

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016198#comment-13016198 ] 

Jean-Daniel Cryans commented on HBASE-3478:
-------------------------------------------

This situation happened to someone on the mailing list.

> HBase fails to recover from failed DNS resolution of stale meta connection info
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3478
>                 URL: https://issues.apache.org/jira/browse/HBASE-3478
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1
>            Reporter: James Kennedy
>             Fix For: 0.92.0
>
>
> This looks like a variant of HBASE-3445:
> One of our developers ran a seed program with configuration A to generate some test data on his local machine. He then moved that data into a development environment on the same machine with a different hbase configuration B.
> On startup the HMaster waits for new regionserver to register itself:
> [25/01/11 15:37:25] 162161 [  HRegionServer] INFO  ase.regionserver.HRegionServer  - Telling master at 10.0.1.4:7801 that we are up
> [25/01/11 15:37:25] 162165 [ice-EventThread] DEBUG .hadoop.hbase.zookeeper.ZKUtil  - master:7801-0x12dbf879abe0000 Retrieved 13 byte(s) of data from znode /hbase/rs/10.0.1.4,7802,1295998613814 and set watcher; 10.0.1.4:7802
> Then ROOT region comes online at the right place: 10.0.1.4,7802
> [25/01/11 15:37:31] 168369 [yTasks:70236052] INFO  ase.catalog.RootLocationEditor  - Setting ROOT region location in ZooKeeper as 10.0.1.4:7802
> 3:57 [25/01/11 15:37:31] 168408 [10.0.1.4:7801-0] DEBUG er.handler.OpenedRegionHandler  - Opened region -ROOT-,,0.70236052 on 10.0.1.4,7802,1295998613814
> But then HMaster chokes on the stale META region location.
> [25/01/11 15:37:31] 168448 [        HMaster] ERROR he.hadoop.hbase.HServerAddress  - Could not resolve the DNS name of warren:60020
> [25/01/11 15:37:31] 168448 [        HMaster] FATAL he.hadoop.hbase.master.HMaster  - Unhandled exception. Starting shutdown.
> java.lang.IllegalArgumentException: Could not resolve the DNS name of warren:60020
>    at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
>    at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readLocation(MetaReader.java:344)
>    at org.apache.hadoop.hbase.catalog.MetaReader.readMetaLocation(MetaReader.java:281)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:280)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482)
>    at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
>    at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
>    at java.lang.Thread.run(Thread.java:680)
> First of all, we do not yet understand why in configuration A the RegionInfo resolved to "warren:60020" whereas in configuration B we get "10.0.1.4:7802".  The port numbers make sense but not the "warren" hostname. It's probably specific to Warren's mac environment somehow because no other developer gets this problem when doing the same thing.  "warren" isn't in his hosts file so that remains a mystery.
> But irrespective of that, since the ports differ we expect the stale meta connection data to cause connection failure anyway.  Perhaps in the form of a SocketTimeoutException as in hbase-3445.
> But shouldn't the HMaster handle that by catching the exception and letting verifyMetaRegionLocation() fail so that meta regions get reassigned to the new region server?
> Probably the safeguards in CatalogTracker.getCachedConnection() should move up to getMetaServerConnection() so as to encompass MetaReader.readMetaLocation() also. Essentially if getMetaServerConnection() encounters ANY exception connection to meta RegionServer it should probably just return null to force meta region reassignment.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira