You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/02/15 21:33:00 UTC
[jira] [Comment Edited] (HBASE-14498) Master stuck in infinite loop when all Zookeeper servers are unreachable (and RS may run after losing its znode)

    [ https://issues.apache.org/jira/browse/HBASE-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769759#comment-16769759 ] 

Sergey Shelukhin edited comment on HBASE-14498 at 2/15/19 9:32 PM:
-------------------------------------------------------------------

This is actually a critical data loss issue, because if RS is network split from ZK, master could see its znode expire and reassign the region while RS ZK watcher is in the loop forever.
We've just had a ZK outage (not network) and some RSes kept running for hours without having a connection, and doing region stuff.

+1 on the patch for now, I will commit Monday if no objections. nit: should use HConstants.DEFAULT_ZK_SESSION_TIMEOUT, not 90000.

I think this area needs to be hardened; if the server disconnects just before ZK ping update that was somehow delayed for almost the whole session timeout, and then waits 2/3rd of the timeout to abort, that means master has to wait conntimeout + some time after the znode is gone before reassigning regions. 
Also,  I am not sure abort is good enough because it also takes time and may do cleanup and such that affects region directories and WALs. Ideally the server should kill -9 itself when losing the lock, or smth close enough.
Also, I wonder if we should use Curator; e.g. use a lock, with master trying to steal all server locks - the server is dead as soon as the lock is stolen. Although I guess there, SUSPENDED still needs to be handled in a similar way; but at least I hope it sends LOST on the lack of connection. Also, it would reduce HBase ZK code and benefit from Curator wisdom :)
I'll file a JIRA or two after this patch.



was (Author: sershe):
This is actually a critical data loss issue, because if RS is network split from ZK, master could see its znode expire and reassign the region.
We've just had a ZK outage (not network) and some RSes kept running for hours without having a connection, and doing region stuff.

+1 on the patch for now, I will commit Monday if no objections. nit: should use HConstants.DEFAULT_ZK_SESSION_TIMEOUT, not 90000.

I think this area needs to be hardened; if the server disconnects just before ZK ping update that was somehow delayed for almost the whole session timeout, and then waits 2/3rd of the timeout to abort, that means master has to wait conntimeout + some time after the znode is gone before reassigning regions. 
Also,  I am not sure abort is good enough because it also takes time and may do cleanup and such that affects region directories and WALs. Ideally the server should kill -9 itself when losing the lock, or smth close enough.
Also, I wonder if we should use Curator; e.g. use a lock, with master trying to steal all server locks - the server is dead as soon as the lock is stolen. Although I guess there, SUSPENDED still needs to be handled in a similar way; but at least I hope it sends LOST on the lack of connection. Also, it would reduce HBase ZK code and benefit from Curator wisdom :)
I'll file a JIRA or two after this patch.


> Master stuck in infinite loop when all Zookeeper servers are unreachable (and RS may run after losing its znode)
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14498
>                 URL: https://issues.apache.org/jira/browse/HBASE-14498
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 3.0.0, 1.5.0, 2.0.0, 2.2.0
>            Reporter: Y. SREENIVASULU REDDY
>            Assignee: Pankaj Kumar
>            Priority: Blocker
>             Fix For: 3.0.0
>
>         Attachments: HBASE-14498-V2.patch, HBASE-14498-V3.patch, HBASE-14498-V4.patch, HBASE-14498-V5.patch, HBASE-14498-V6.patch, HBASE-14498-V6.patch, HBASE-14498-addendum.patch, HBASE-14498-branch-1.2.patch, HBASE-14498-branch-1.3-V2.patch, HBASE-14498-branch-1.3.patch, HBASE-14498-branch-1.4.patch, HBASE-14498-branch-1.patch, HBASE-14498.007.patch, HBASE-14498.008.patch, HBASE-14498.master.001.patch, HBASE-14498.master.002.patch, HBASE-14498.patch
>
>
> We met a weird scenario in our production environment.
> In a HA cluster,
> > Active Master (HM1) is not able to connect to any Zookeeper server (due to N/w breakdown on master machine network with Zookeeper servers).
> {code}
> 2015-09-26 15:24:47,508 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 33463ms for sessionid 0x104576b8dda0002, closing socket connection and attempting reconnect
> 2015-09-26 15:24:47,877 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] client.FourLetterWordMain: connecting to ZK-Host1 2181
> 2015-09-26 15:24:48,236 INFO [main-SendThread(ZK-Host1:2181)] client.FourLetterWordMain: connecting to ZK-Host1 2181
> 2015-09-26 15:24:49,879 WARN [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
> 2015-09-26 15:24:49,879 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host1/ZK-IP1:2181. Will not attempt to authenticate using SASL (unknown error)
> 2015-09-26 15:24:50,238 WARN [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
> 2015-09-26 15:24:50,238 INFO [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host1/ZK-Host1:2181. Will not attempt to authenticate using SASL (unknown error)
> 2015-09-26 15:25:17,470 INFO [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 30023ms for sessionid 0x2045762cc710006, closing socket connection and attempting reconnect
> 2015-09-26 15:25:17,571 WARN [master/HM1-Host/HM1-IP:16000] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ZK-Host:2181,ZK-Host1:2181,ZK-Host2:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
> 2015-09-26 15:25:17,872 INFO [main-SendThread(ZK-Host:2181)] client.FourLetterWordMain: connecting to ZK-Host 2181
> 2015-09-26 15:25:19,874 WARN [main-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host
> 2015-09-26 15:25:19,874 INFO [main-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host/ZK-IP:2181. Will not attempt to authenticate using SASL (unknown error)
> {code}
> > Since HM1 was not able to connect to any ZK, so session timeout didnt happen at Zookeeper server side and HM1 didnt abort.
> > On Zookeeper session timeout standby master (HM2) registered himself as an active master. 
> > HM2 is keep on waiting for region server to report him as part of active master intialization.
> {noformat} 
> 2015-09-26 15:24:44,928 | INFO | HM2-Host:21300.activeMasterManager | Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. | org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
> ---
> ---
> 2015-09-26 15:32:50,841 | INFO | HM2-Host:21300.activeMasterManager | Waiting for region servers count to settle; currently checked in 0, slept for 483913 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. | org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
> {noformat}
> > At other end, region servers are reporting to HM1 on 3 sec interval. Here region server retrieve master location from zookeeper only when they couldn't connect to Master (ServiceException).
> Region Server will not report HM2 as per current design until unless HM1 abort,so HM2 will exit(InitializationMonitor) and again wait for region servers in loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)