You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2010/11/23 09:35:13 UTC

[jira] Created: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

Master does not seem to properly scan ZK for running RS during startup
----------------------------------------------------------------------

                 Key: HBASE-3266
                 URL: https://issues.apache.org/jira/browse/HBASE-3266
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 0.90.0
            Reporter: Todd Lipcon
            Priority: Critical


I was in the situation described by HBASE-3265, where I had a number of RS waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on checkins. To get past this, I restarted one of the region servers. The restarted server checked in, and the master began its startup.
At this point the master started scanning /hbase/.logs for things to split. It correctly identified that the RS on haus01 was running (this is the one I restarted):

2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143 belongs to an existing region server

but then incorrectly decided that the RS on haus02 was down:

2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450 doesn't belong to a known region server, splitting

However ZK shows that this RS is up:
[zk: haus01.sf.cloudera.com:2222(CONNECTED) 3] ls /hbase/rs
[haus04.sf.cloudera.com,60020,1290498411533, haus05.sf.cloudera.com,60020,1290498411520, haus03.sf.cloudera.com,60020,1290498411518, haus01.sf.cloudera.com,60020,1290500443143, haus02.sf.cloudera.com,60020,1290498411450]

splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I can tell is derived from heartbeats and not from ZK (sorry if I got some of this wrong, still new to this new codebase)

Of course, the master went into an infinite splitting loop at this point since haus02 is up and renewing its DFS lease on its logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Resolved] (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu resolved HBASE-3266.
---------------------------

    Resolution: Not A Problem

>From Todd:
3266 is probably no longer valid given heartbeats don't exist in trunk.

> Master does not seem to properly scan ZK for running RS during startup
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3266
>                 URL: https://issues.apache.org/jira/browse/HBASE-3266
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> I was in the situation described by HBASE-3265, where I had a number of RS waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on checkins. To get past this, I restarted one of the region servers. The restarted server checked in, and the master began its startup.
> At this point the master started scanning /hbase/.logs for things to split. It correctly identified that the RS on haus01 was running (this is the one I restarted):
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143 belongs to an existing region server
> but then incorrectly decided that the RS on haus02 was down:
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450 doesn't belong to a known region server, splitting
> However ZK shows that this RS is up:
> [zk: haus01.sf.cloudera.com:2222(CONNECTED) 3] ls /hbase/rs
> [haus04.sf.cloudera.com,60020,1290498411533, haus05.sf.cloudera.com,60020,1290498411520, haus03.sf.cloudera.com,60020,1290498411518, haus01.sf.cloudera.com,60020,1290500443143, haus02.sf.cloudera.com,60020,1290498411450]
> splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I can tell is derived from heartbeats and not from ZK (sorry if I got some of this wrong, still new to this new codebase)
> Of course, the master went into an infinite splitting loop at this point since haus02 is up and renewing its DFS lease on its logs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934924#action_12934924 ] 

Jonathan Gray commented on HBASE-3266:
--------------------------------------

Yeah, I think as it is currently the HMaster is using the startup/heartbeat messages to determine which RS are online.  As I commented in the other jira, we should see why they were not doing so.

We should do some reconciliation between what we find in ZK and what we think is online based on RPCs, but not sure exactly what course we would take in a state like this.

> Master does not seem to properly scan ZK for running RS during startup
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3266
>                 URL: https://issues.apache.org/jira/browse/HBASE-3266
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> I was in the situation described by HBASE-3265, where I had a number of RS waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on checkins. To get past this, I restarted one of the region servers. The restarted server checked in, and the master began its startup.
> At this point the master started scanning /hbase/.logs for things to split. It correctly identified that the RS on haus01 was running (this is the one I restarted):
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143 belongs to an existing region server
> but then incorrectly decided that the RS on haus02 was down:
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450 doesn't belong to a known region server, splitting
> However ZK shows that this RS is up:
> [zk: haus01.sf.cloudera.com:2222(CONNECTED) 3] ls /hbase/rs
> [haus04.sf.cloudera.com,60020,1290498411533, haus05.sf.cloudera.com,60020,1290498411520, haus03.sf.cloudera.com,60020,1290498411518, haus01.sf.cloudera.com,60020,1290500443143, haus02.sf.cloudera.com,60020,1290498411450]
> splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I can tell is derived from heartbeats and not from ZK (sorry if I got some of this wrong, still new to this new codebase)
> Of course, the master went into an infinite splitting loop at this point since haus02 is up and renewing its DFS lease on its logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3266:
-------------------------

    Fix Version/s:     (was: 0.90.0)
                   0.92.0

Chatting with Jon, there is a problem here if zk does not agree with online servers.  This would happen if HRS is stuck.  We could add a reconcile to the master on startup such that if a discrepancy, then master could expire the HRS in effect killing it.  This would be good (especially if could be done in a non-racey way).

But thought is that this condition should be extremely rare especially since HBASE-3265 went in  AND given that we'd like to do away with heartbeating altogether, lets just punt this out to 0.92 rather than hack up some messy reconcile.

Moving out.  Please move back in if disagree.

> Master does not seem to properly scan ZK for running RS during startup
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3266
>                 URL: https://issues.apache.org/jira/browse/HBASE-3266
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> I was in the situation described by HBASE-3265, where I had a number of RS waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on checkins. To get past this, I restarted one of the region servers. The restarted server checked in, and the master began its startup.
> At this point the master started scanning /hbase/.logs for things to split. It correctly identified that the RS on haus01 was running (this is the one I restarted):
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143 belongs to an existing region server
> but then incorrectly decided that the RS on haus02 was down:
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450 doesn't belong to a known region server, splitting
> However ZK shows that this RS is up:
> [zk: haus01.sf.cloudera.com:2222(CONNECTED) 3] ls /hbase/rs
> [haus04.sf.cloudera.com,60020,1290498411533, haus05.sf.cloudera.com,60020,1290498411520, haus03.sf.cloudera.com,60020,1290498411518, haus01.sf.cloudera.com,60020,1290500443143, haus02.sf.cloudera.com,60020,1290498411450]
> splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I can tell is derived from heartbeats and not from ZK (sorry if I got some of this wrong, still new to this new codebase)
> Of course, the master went into an infinite splitting loop at this point since haus02 is up and renewing its DFS lease on its logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3266:
-------------------------

    Fix Version/s: 0.90.0

Bringing into 0.90.0 while we triage.

> Master does not seem to properly scan ZK for running RS during startup
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3266
>                 URL: https://issues.apache.org/jira/browse/HBASE-3266
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> I was in the situation described by HBASE-3265, where I had a number of RS waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on checkins. To get past this, I restarted one of the region servers. The restarted server checked in, and the master began its startup.
> At this point the master started scanning /hbase/.logs for things to split. It correctly identified that the RS on haus01 was running (this is the one I restarted):
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143 belongs to an existing region server
> but then incorrectly decided that the RS on haus02 was down:
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450 doesn't belong to a known region server, splitting
> However ZK shows that this RS is up:
> [zk: haus01.sf.cloudera.com:2222(CONNECTED) 3] ls /hbase/rs
> [haus04.sf.cloudera.com,60020,1290498411533, haus05.sf.cloudera.com,60020,1290498411520, haus03.sf.cloudera.com,60020,1290498411518, haus01.sf.cloudera.com,60020,1290500443143, haus02.sf.cloudera.com,60020,1290498411450]
> splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I can tell is derived from heartbeats and not from ZK (sorry if I got some of this wrong, still new to this new codebase)
> Of course, the master went into an infinite splitting loop at this point since haus02 is up and renewing its DFS lease on its logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.