You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Karthik Kambatla (JIRA)" <ji...@apache.org> on 2014/06/21 01:58:24 UTC

[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

    [ https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039564#comment-14039564 ] 

Karthik Kambatla commented on HADOOP-10584:
-------------------------------------------

Logs from when we saw this error:

{noformat}
zzzz-yy-xx 06:01:30,039 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3335ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:30,144 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
zzzz-yy-xx 06:01:30,233 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-1/10.1.128.51:2181. Will not attempt to authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:30,233 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-1/10.1.128.51:2181, initiating session
zzzz-yy-xx 06:01:31,901 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 1667ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:32,405 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-2/10.1.128.48:2181. Will not attempt to authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:32,406 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-2/10.1.128.48:2181, initiating session
zzzz-yy-xx 06:01:32,409 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server MASKED-2/10.1.128.48:2181, sessionid = 0x2459abcbfd0027f, negotiated timeout = 5000
zzzz-yy-xx 06:01:32,412 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
zzzz-yy-xx 06:01:35,742 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:35,850 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
zzzz-yy-xx 06:01:35,966 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-3/10.1.128.49:2181. Will not attempt to authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:35,967 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-3/10.1.128.49:2181, initiating session
zzzz-yy-xx 06:01:35,968 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server MASKED-3/10.1.128.49:2181, sessionid = 0x2459abcbfd0027f, negotiated timeout = 5000
zzzz-yy-xx 06:01:35,972 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
zzzz-yy-xx 06:01:39,303 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3335ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:39,411 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
zzzz-yy-xx 06:01:39,904 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-1/10.1.128.51:2181. Will not attempt to authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:39,904 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-1/10.1.128.51:2181, initiating session
zzzz-yy-xx 06:01:41,572 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 1668ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:41,678 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
zzzz-yy-xx 06:01:41,926 INFO org.apache.zookeeper.ZooKeeper: Session: 0x2459abcbfd0027f closed
zzzz-yy-xx 06:01:41,927 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
zzzz-yy-xx 06:01:41,927 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x2459abcbfd0027f
zzzz-yy-xx 06:01:41,927 INFO org.apache.hadoop.ipc.Server: Stopping server on 8018
zzzz-yy-xx 06:01:41,927 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8018
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
{noformat}

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --------------------------------------------------------------
>
>                 Key: HADOOP-10584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10584
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: hadoop-10584-prelim.patch
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum itself is down, it goes down and the daemons will have to be brought up again. 
> Instead, it should log the fact that it is unable to talk to ZK, call becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.2#6252)