You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Karthik Kambatla (JIRA)" <ji...@apache.org> on 2014/06/21 01:58:24 UTC
[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if
ZK quorum become unavailable
[ https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039564#comment-14039564 ]
Karthik Kambatla commented on HADOOP-10584:
-------------------------------------------
Logs from when we saw this error:
{noformat}
zzzz-yy-xx 06:01:30,039 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3335ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:30,144 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
zzzz-yy-xx 06:01:30,233 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-1/10.1.128.51:2181. Will not attempt to authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:30,233 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-1/10.1.128.51:2181, initiating session
zzzz-yy-xx 06:01:31,901 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 1667ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:32,405 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-2/10.1.128.48:2181. Will not attempt to authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:32,406 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-2/10.1.128.48:2181, initiating session
zzzz-yy-xx 06:01:32,409 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server MASKED-2/10.1.128.48:2181, sessionid = 0x2459abcbfd0027f, negotiated timeout = 5000
zzzz-yy-xx 06:01:32,412 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
zzzz-yy-xx 06:01:35,742 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:35,850 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
zzzz-yy-xx 06:01:35,966 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-3/10.1.128.49:2181. Will not attempt to authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:35,967 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-3/10.1.128.49:2181, initiating session
zzzz-yy-xx 06:01:35,968 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server MASKED-3/10.1.128.49:2181, sessionid = 0x2459abcbfd0027f, negotiated timeout = 5000
zzzz-yy-xx 06:01:35,972 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
zzzz-yy-xx 06:01:39,303 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3335ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:39,411 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
zzzz-yy-xx 06:01:39,904 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-1/10.1.128.51:2181. Will not attempt to authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:39,904 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-1/10.1.128.51:2181, initiating session
zzzz-yy-xx 06:01:41,572 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 1668ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:41,678 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
zzzz-yy-xx 06:01:41,926 INFO org.apache.zookeeper.ZooKeeper: Session: 0x2459abcbfd0027f closed
zzzz-yy-xx 06:01:41,927 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
zzzz-yy-xx 06:01:41,927 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x2459abcbfd0027f
zzzz-yy-xx 06:01:41,927 INFO org.apache.hadoop.ipc.Server: Stopping server on 8018
zzzz-yy-xx 06:01:41,927 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8018
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
{noformat}
> ActiveStandbyElector goes down if ZK quorum become unavailable
> --------------------------------------------------------------
>
> Key: HADOOP-10584
> URL: https://issues.apache.org/jira/browse/HADOOP-10584
> Project: Hadoop Common
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.4.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Priority: Critical
> Attachments: hadoop-10584-prelim.patch
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum itself is down, it goes down and the daemons will have to be brought up again.
> Instead, it should log the fact that it is unable to talk to ZK, call becomeStandby on its client, and continue to attempt connecting to ZK.
--
This message was sent by Atlassian JIRA
(v6.2#6252)