You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2011/07/26 22:36:09 UTC

[jira] [Resolved] (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

     [ https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu resolved HBASE-3266.
---------------------------

    Resolution: Not A Problem

>From Todd:
3266 is probably no longer valid given heartbeats don't exist in trunk.

> Master does not seem to properly scan ZK for running RS during startup
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3266
>                 URL: https://issues.apache.org/jira/browse/HBASE-3266
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> I was in the situation described by HBASE-3265, where I had a number of RS waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on checkins. To get past this, I restarted one of the region servers. The restarted server checked in, and the master began its startup.
> At this point the master started scanning /hbase/.logs for things to split. It correctly identified that the RS on haus01 was running (this is the one I restarted):
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143 belongs to an existing region server
> but then incorrectly decided that the RS on haus02 was down:
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450 doesn't belong to a known region server, splitting
> However ZK shows that this RS is up:
> [zk: haus01.sf.cloudera.com:2222(CONNECTED) 3] ls /hbase/rs
> [haus04.sf.cloudera.com,60020,1290498411533, haus05.sf.cloudera.com,60020,1290498411520, haus03.sf.cloudera.com,60020,1290498411518, haus01.sf.cloudera.com,60020,1290500443143, haus02.sf.cloudera.com,60020,1290498411450]
> splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I can tell is derived from heartbeats and not from ZK (sorry if I got some of this wrong, still new to this new codebase)
> Of course, the master went into an infinite splitting loop at this point since haus02 is up and renewing its DFS lease on its logs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira