You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2010/09/17 10:08:32 UTC
[jira] Created: (HBASE-3010) Can't start/stop/start... cluster
using new master
Can't start/stop/start... cluster using new master
--------------------------------------------------
Key: HBASE-3010
URL: https://issues.apache.org/jira/browse/HBASE-3010
Project: HBase
Issue Type: Bug
Components: master
Reporter: stack
Priority: Blocker
Fix For: 0.90.0
Currently you might start a small cluster the first time on TRUNK -- i.e. new master -- but second time you do the startup you run into a couple of interesting issues:
+ The old root-region-location is still in place. It gets cleaned later but for a while on startup it does not have the 'right' address.
+ Regionserver (or a client) on startup creates a catalogtracker, a class that notices changes in meta tables keeping up catalog table locations. Starting the catalogtracker results in a check for current catalog locations. As part of this process, since root-region-location "exists", catalogtracker tries to verify root's location by doing a noop against root host, only, to do this it needs to do the initial rpc proxy setup. It can so happen that the old root address was that of the current regionserver trying to initialize so we'll be trying to connect to ourself to verify root location ONLY, we're doing this before we've setup the rpcserver and handlers -- so we block, and as it happens there is no timeout on proxy setup (Todd ran into this yesterday, I ran into it today -- its easy to manufacture).
+ So regionserver can't progress. Meantime the master can't progress because there are no regionservers checking in. And you can't shut it down because we're not looking at the right 'stop' flag
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-3010) Can't start/stop/start... cluster
using new master
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack resolved HBASE-3010.
--------------------------
Hadoop Flags: [Reviewed]
Assignee: stack
Resolution: Fixed
Thanks for review Todd. Committed.
> Can't start/stop/start... cluster using new master
> --------------------------------------------------
>
> Key: HBASE-3010
> URL: https://issues.apache.org/jira/browse/HBASE-3010
> Project: HBase
> Issue Type: Bug
> Components: master
> Reporter: stack
> Assignee: stack
> Priority: Blocker
> Fix For: 0.90.0
>
>
> Currently you might start a small cluster the first time on TRUNK -- i.e. new master -- but second time you do the startup you run into a couple of interesting issues:
> + The old root-region-location is still in place. It gets cleaned later but for a while on startup it does not have the 'right' address.
> + Regionserver (or a client) on startup creates a catalogtracker, a class that notices changes in meta tables keeping up catalog table locations. Starting the catalogtracker results in a check for current catalog locations. As part of this process, since root-region-location "exists", catalogtracker tries to verify root's location by doing a noop against root host, only, to do this it needs to do the initial rpc proxy setup. It can so happen that the old root address was that of the current regionserver trying to initialize so we'll be trying to connect to ourself to verify root location ONLY, we're doing this before we've setup the rpcserver and handlers -- so we block, and as it happens there is no timeout on proxy setup (Todd ran into this yesterday, I ran into it today -- its easy to manufacture).
> + So regionserver can't progress. Meantime the master can't progress because there are no regionservers checking in. And you can't shut it down because we're not looking at the right 'stop' flag
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3010) Can't start/stop/start... cluster
using new master
Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910884#action_12910884 ]
HBase Review Board commented on HBASE-3010:
-------------------------------------------
Message from: stack@duboce.net
bq. On 2010-09-17 16:25:15, Todd Lipcon wrote:
bq. > src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java, line 142
bq. > <http://review.cloudera.org/r/873/diff/1/?file=11929#file11929line142>
bq. >
bq. > hrm, I guess that's a good idea, but something seems a little strange about this :)
Yeah, this is a little 'bold' but trying to think around it, i couldn't see issue w/ it, whereas not doing it is going to frustrate as restart will have this minute or so stall while we waiting on the znode expire. I'd say its good for now and I suppose we'll see later if it becomes a prop.
bq. On 2010-09-17 16:25:15, Todd Lipcon wrote:
bq. > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 222
bq. > <http://review.cloudera.org/r/873/diff/1/?file=11931#file11931line222>
bq. >
bq. > this should probably move down until after we're the active master
sure... will do on commit.
- stack
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/873/#review1267
-----------------------------------------------------------
> Can't start/stop/start... cluster using new master
> --------------------------------------------------
>
> Key: HBASE-3010
> URL: https://issues.apache.org/jira/browse/HBASE-3010
> Project: HBase
> Issue Type: Bug
> Components: master
> Reporter: stack
> Assignee: stack
> Priority: Blocker
> Fix For: 0.90.0
>
>
> Currently you might start a small cluster the first time on TRUNK -- i.e. new master -- but second time you do the startup you run into a couple of interesting issues:
> + The old root-region-location is still in place. It gets cleaned later but for a while on startup it does not have the 'right' address.
> + Regionserver (or a client) on startup creates a catalogtracker, a class that notices changes in meta tables keeping up catalog table locations. Starting the catalogtracker results in a check for current catalog locations. As part of this process, since root-region-location "exists", catalogtracker tries to verify root's location by doing a noop against root host, only, to do this it needs to do the initial rpc proxy setup. It can so happen that the old root address was that of the current regionserver trying to initialize so we'll be trying to connect to ourself to verify root location ONLY, we're doing this before we've setup the rpcserver and handlers -- so we block, and as it happens there is no timeout on proxy setup (Todd ran into this yesterday, I ran into it today -- its easy to manufacture).
> + So regionserver can't progress. Meantime the master can't progress because there are no regionservers checking in. And you can't shut it down because we're not looking at the right 'stop' flag
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3010) Can't start/stop/start... cluster
using new master
Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910855#action_12910855 ]
HBase Review Board commented on HBASE-3010:
-------------------------------------------
Message from: "Todd Lipcon" <to...@cloudera.com>
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/873/#review1267
-----------------------------------------------------------
Ship it!
src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
<http://review.cloudera.org/r/873/#comment4312>
hrm, I guess that's a good idea, but something seems a little strange about this :)
src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/873/#comment4313>
this should probably move down until after we're the active master
- Todd
> Can't start/stop/start... cluster using new master
> --------------------------------------------------
>
> Key: HBASE-3010
> URL: https://issues.apache.org/jira/browse/HBASE-3010
> Project: HBase
> Issue Type: Bug
> Components: master
> Reporter: stack
> Priority: Blocker
> Fix For: 0.90.0
>
>
> Currently you might start a small cluster the first time on TRUNK -- i.e. new master -- but second time you do the startup you run into a couple of interesting issues:
> + The old root-region-location is still in place. It gets cleaned later but for a while on startup it does not have the 'right' address.
> + Regionserver (or a client) on startup creates a catalogtracker, a class that notices changes in meta tables keeping up catalog table locations. Starting the catalogtracker results in a check for current catalog locations. As part of this process, since root-region-location "exists", catalogtracker tries to verify root's location by doing a noop against root host, only, to do this it needs to do the initial rpc proxy setup. It can so happen that the old root address was that of the current regionserver trying to initialize so we'll be trying to connect to ourself to verify root location ONLY, we're doing this before we've setup the rpcserver and handlers -- so we block, and as it happens there is no timeout on proxy setup (Todd ran into this yesterday, I ran into it today -- its easy to manufacture).
> + So regionserver can't progress. Meantime the master can't progress because there are no regionservers checking in. And you can't shut it down because we're not looking at the right 'stop' flag
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3010) Can't start/stop/start... cluster
using new master
Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910509#action_12910509 ]
HBase Review Board commented on HBASE-3010:
-------------------------------------------
Message from: stack@duboce.net
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/873/
-----------------------------------------------------------
Review request for hbase.
Summary
-------
Patch changes catalogtracker to not connect to root on start. Instead, wait on kick from zk before going after root or meta. This change doesn't address the case where a RS can get stuck on itself trying to connect to an RPC that is not yet running. Rather it sidesteps (We should come back and do something about non-timeout when setting up proxy since its possible. I'll file an issue on it). Patch includes other fixups, not all cosmetic.
M src/test/java/org/apache/hadoop/hbase/master/TestActiveMasterManager.java
Add a test for case where master comes up and up in zk the
master znode contains our address
D src/test/java/org/apache/hadoop/hbase/master/TestMinimumServerCount.java
Removed test of something we no longer do; wait on an explicit number
of regions to come in before we'll go ahead w/ master startup.
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
Swapped order in which we do some of the startup (Cosmetic)
M src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
Javadoc
M src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Removed minimumServerCount. Seems bad predicating master startup
on N RS's coming in.
Renamed method numServers as countOfRegionServers and made it protected.
Removed other unused methods.
Redid waitForMinServers as waitForRegionServers... where we just
hang around until count of regionservers stabilizes. TODO: improve
M src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
Handle case where the current master znode has our address; in this
case we can hurry up the expiration by deleting the znode.
M src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
Minor formatting
M src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Renamed clusterStarter as freshClusterStartup. Predicate this boolean
off the count of regionservers. If 0, then fresh cluster start. Else
do special handling (TODO).
Edit on HMaster constructor comments.
Moved some code out of Master constructor into stackIfBackupMaster method
If aborting set stop flag.
M src/main/java/org/apache/hadoop/hbase/master/HMasterCommandLine.java
Removed unused imports.
M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
Make catalogtracker lazy about getting metalocation....don't do it
on start.
This addresses bug hbase-3010.
http://issues.apache.org/jira/browse/hbase-3010
Diffs
-----
src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 2bcd5d0
src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java 87fe9cd
src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 690f78c
src/main/java/org/apache/hadoop/hbase/master/HMaster.java c1b80eb
src/main/java/org/apache/hadoop/hbase/master/HMasterCommandLine.java c675db9
src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java 498650f
src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 528bb9d
src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1ec7f4e
src/test/java/org/apache/hadoop/hbase/master/TestActiveMasterManager.java 030bc12
src/test/java/org/apache/hadoop/hbase/master/TestMinimumServerCount.java d6f2c02
Diff: http://review.cloudera.org/r/873/diff
Testing
-------
Can now start/stop cluster repeatedly.
Thanks,
stack
> Can't start/stop/start... cluster using new master
> --------------------------------------------------
>
> Key: HBASE-3010
> URL: https://issues.apache.org/jira/browse/HBASE-3010
> Project: HBase
> Issue Type: Bug
> Components: master
> Reporter: stack
> Priority: Blocker
> Fix For: 0.90.0
>
>
> Currently you might start a small cluster the first time on TRUNK -- i.e. new master -- but second time you do the startup you run into a couple of interesting issues:
> + The old root-region-location is still in place. It gets cleaned later but for a while on startup it does not have the 'right' address.
> + Regionserver (or a client) on startup creates a catalogtracker, a class that notices changes in meta tables keeping up catalog table locations. Starting the catalogtracker results in a check for current catalog locations. As part of this process, since root-region-location "exists", catalogtracker tries to verify root's location by doing a noop against root host, only, to do this it needs to do the initial rpc proxy setup. It can so happen that the old root address was that of the current regionserver trying to initialize so we'll be trying to connect to ourself to verify root location ONLY, we're doing this before we've setup the rpcserver and handlers -- so we block, and as it happens there is no timeout on proxy setup (Todd ran into this yesterday, I ran into it today -- its easy to manufacture).
> + So regionserver can't progress. Meantime the master can't progress because there are no regionservers checking in. And you can't shut it down because we're not looking at the right 'stop' flag
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.