You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by elastic search <el...@gmail.com> on 2015/10/14 02:51:50 UTC

Observers Unusable

Hello Experts

We have 2 Observers running in AWS connecting over to local ZK Ensemble in
our own DataCenter.

There have been instances where we see network drop for a minute between
the networks.
However the Observers take around 15 minutes to recover even if the network
outage is for a minute.

>From the logs
java.net.SocketTimeoutException: Read timed out
2015-10-13 22:26:03,927 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 400
2015-10-13 22:26:04,328 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 800
2015-10-13 22:26:05,129 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 1600
2015-10-13 22:26:06,730 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 3200
2015-10-13 22:26:09,931 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 6400
2015-10-13 22:26:16,332 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 12800
2015-10-13 22:26:29,133 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 25600
2015-10-13 22:26:54,734 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 51200
2015-10-13 22:27:45,935 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:28:45,936 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:29:45,937 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:30:45,938 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:31:45,939 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:32:45,940 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:33:45,941 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:34:45,942 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:35:45,943 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:36:45,944 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:37:45,945 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:38:45,946 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:39:45,947 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:40:45,948 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2015-10-13 22:41:45,949 [myid:4] - INFO
 [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000

And then finally exits the QuorumCnxManager run loop with the following
message
WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection broken
for id 2

How can we ensure the observer does not go out for service such a long
duration ?

Attached the full logs

Please help
Thanks

Re: Observers Unusable

Posted by elastic search <el...@gmail.com>.

Thanks Bob and Flavio for your replies.

I am considering opening a JIRA ticket for this issue , unless somebody
sees this is not an Observer issue

On Wed, Oct 14, 2015 at 3:20 PM, Bob Sheehan <bs...@vmware.com> wrote:

> To more specific after leader election the old observer connections to
> previous leader are staleŠ(send-q accumulates) but the tcp timeout after
> 15 minutes terminates these old connections and allows fresh TCP
> connection to new leader.
>
> Bob
>
> On 10/14/15, 3:17 PM, "Bob Sheehan" <bs...@vmware.com> wrote:
>
> >We face a similar issue with remote observers connecting to 3 node
> >cluster. Using netstat we see a send-q tcp traffic accumulating. We
> >believe this is a tcp timeout issue.
> >
> >Bob
> >
> >
> >On 10/14/15, 2:54 PM, "elastic search" <el...@gmail.com> wrote:
> >
> >>When it outputs this message "There is a connection already for server" ,
> >>is it really valid connection or using the existing map in memory to
> >>figure
> >>if the connection already exists.
> >>If that is the case it should empty this map and initate a connection
> >>when
> >>it is LOOKING for a leader after a connection drop.
> >>Right now it waits 15 minutes before it does that and re-iniates a
> >>connection
> >>
> >> if (senderWorkerMap.get(sid) == null){
> >>   } else {
> >>            LOG.debug("There is a connection already for server " + sid);
> >>
> >>If the ping to the server is not working how is it able to send
> >>notifications . The logs says Sending notifications but the ping is not
> >>working at that time.
> >>
> >>I am not sure how to decrease the 15 minute wait period before the
> >>Observer
> >>re-iniates a connection.
> >>Or Alternatively how to make this work
> >>
> >>
> >>On Wed, Oct 14, 2015 at 12:50 PM, Flavio Junqueira <fp...@apache.org>
> >>wrote:
> >>
> >>> The snapshot is really small, so it is unlikely to be it, unless the
> >>> connection is really crappy.
> >>>
> >>> -Flavio
> >>>
> >>> > On 14 Oct 2015, at 19:38, elastic search <el...@gmail.com>
> >>>wrote:
> >>> >
> >>> > Hi Flavio
> >>> >
> >>> > Thanks for the reply.
> >>> >
> >>> > These are the config parameters
> >>> > tickTime=2000
> >>> > initLimit=10
> >>> > syncLimit=5
> >>> >
> >>> > Data size is
> >>> > -rw-r--r-- 1 xoom xoom  85K Oct  3 10:16 snapshot.170001badb
> >>> > -rw-r--r-- 1 xoom xoom  65M Oct  3 10:16 log.170000da7f
> >>> > -rw-r--r-- 1 xoom xoom  88K Oct 13 23:30 snapshot.170002bfcd
> >>> > -rw-r--r-- 1 xoom xoom  65M Oct 13 23:30 log.170001badd
> >>> > -rw-r--r-- 1 xoom xoom  65M Oct 14 11:36 log.170002bfcf
> >>> >
> >>> > These are all defaults and i have never deleted any log files.
> >>> > Do you still think this could be due to syncLimit or the snapshot
> >>>being
> >>> > large
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <fp...@apache.org>
> >>> wrote:
> >>> >
> >>> >>
> >>> >>> On 14 Oct 2015, at 17:22, elastic search <el...@gmail.com>
> >>> wrote:
> >>> >>>
> >>> >>> The link between the AWS and the local DataCenter was down for
> >>>around 2
> >>> >>> minutes.
> >>> >>> I have been running ping continuously from the DataCenter to the
> >>>AWS
> >>> and
> >>> >>> that wasn't responding for a few minutes.
> >>> >>>
> >>> >>> Are you saying since we see Send Notification messages in the logs
> >>>,
> >>> that
> >>> >>> would mean the Observers are able to connect to the ZK .
> >>> >>> Only that the ZK server leader is not able to respond back ?
> >>> >>
> >>> >> Yeah, the logs indicate that the observer is being able to connect
> >>> ("There
> >>> >> is a connection already for server..." in the logs) but isn't able
> >>>to
> >>> make
> >>> >> progress, which implies that it isn't receiving responses.
> >>> >>
> >>> >>>
> >>> >>> This is what i see from the Server logs
> >>> >>> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/
> >>> >> 10.10.4.46:38161
> >>> >>> :LearnerHandler@633] - Unexpected exception causing shutdown while
> >>> sock
> >>> >>> still open
> >>> >>> 2015-10-09 16:19:28,160 [myid:3] - WARN
> >>> >>> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken
> >>>for
> >>> >> id
> >>> >>> 5, my id = 3, error =
> >>> >>>
> >>> >>
> >>> >> These lines belong to different parts of the code path. The first
> >>> actually
> >>> >> indicates that the observer found the leader, but could not keep
> >>>going.
> >>> For
> >>> >> this particular time period, it could be that the observer is trying
> >>>to
> >>> >> sync up with the leader, but the value of syncLimit is too low and
> >>>the
> >>> >> leader is abandoning the observer. One reason for that to happen is
> >>>the
> >>> >> snapshot being large (or your available bandwidth being low).
> >>> >>
> >>> >> The second line is from the part of the code that does connections
> >>>for
> >>> >> leader election. That one also says that the connection broke.
> >>> >>
> >>> >> -Flavio
> >>> >>
> >>> >>
> >>> >>> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fp...@apache.org>
> >>> >> wrote:
> >>> >>>
> >>> >>>> Can you tell why the server wasn't responding to the notifications
> >>> from
> >>> >>>> the observer? The log file is from the observer and it sounds like
> >>>it
> >>> is
> >>> >>>> being able to send messages out, but it isn't clear why the server
> >>> isn't
> >>> >>>> responding.
> >>> >>>>
> >>> >>>> -Flavio
> >>> >>>>
> >>> >>>>> On 14 Oct 2015, at 01:51, elastic search <el...@gmail.com>
> >>> >> wrote:
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> Hello Experts
> >>> >>>>>
> >>> >>>>> We have 2 Observers running in AWS connecting over to local ZK
> >>> Ensemble
> >>> >>>> in our own DataCenter.
> >>> >>>>>
> >>> >>>>> There have been instances where we see network drop for a minute
> >>> >> between
> >>> >>>> the networks.
> >>> >>>>> However the Observers take around 15 minutes to recover even if
> >>>the
> >>> >>>> network outage is for a minute.
> >>> >>>>>
> >>> >>>>> From the logs
> >>> >>>>> java.net.SocketTimeoutException: Read timed out
> >>> >>>>> 2015-10-13 22:26:03,927 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 400
> >>> >>>>> 2015-10-13 22:26:04,328 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 800
> >>> >>>>> 2015-10-13 22:26:05,129 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 1600
> >>> >>>>> 2015-10-13 22:26:06,730 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 3200
> >>> >>>>> 2015-10-13 22:26:09,931 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 6400
> >>> >>>>> 2015-10-13 22:26:16,332 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 12800
> >>> >>>>> 2015-10-13 22:26:29,133 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 25600
> >>> >>>>> 2015-10-13 22:26:54,734 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 51200
> >>> >>>>> 2015-10-13 22:27:45,935 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:28:45,936 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:29:45,937 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:30:45,938 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:31:45,939 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:32:45,940 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:33:45,941 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:34:45,942 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:35:45,943 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:36:45,944 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:37:45,945 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:38:45,946 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:39:45,947 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:40:45,948 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>> 2015-10-13 22:41:45,949 [myid:4] - INFO
> >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849]
> -
> >>> >>>> Notification time out: 60000
> >>> >>>>>
> >>> >>>>> And then finally exits the QuorumCnxManager run loop with the
> >>> following
> >>> >>>> message
> >>> >>>>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] -
> Connection
> >>> >>>> broken for id 2
> >>> >>>>>
> >>> >>>>> How can we ensure the observer does not go out for service such a
> >>> long
> >>> >>>> duration ?
> >>> >>>>>
> >>> >>>>> Attached the full logs
> >>> >>>>>
> >>> >>>>> Please help
> >>> >>>>> Thanks
> >>> >>>>>
> >>> >>>>> <zookeeper.log.zip>
> >>> >>>>
> >>> >>>>
> >>> >>
> >>> >>
> >>>
> >>>
> >
>
>

Re: Observers Unusable

Posted by Bob Sheehan <bs...@vmware.com>.

To more specific after leader election the old observer connections to
previous leader are staleŠ(send-q accumulates) but the tcp timeout after
15 minutes terminates these old connections and allows fresh TCP
connection to new leader.

Bob

On 10/14/15, 3:17 PM, "Bob Sheehan" <bs...@vmware.com> wrote:

>We face a similar issue with remote observers connecting to 3 node
>cluster. Using netstat we see a send-q tcp traffic accumulating. We
>believe this is a tcp timeout issue.
>
>Bob
>
>
>On 10/14/15, 2:54 PM, "elastic search" <el...@gmail.com> wrote:
>
>>When it outputs this message "There is a connection already for server" ,
>>is it really valid connection or using the existing map in memory to
>>figure
>>if the connection already exists.
>>If that is the case it should empty this map and initate a connection
>>when
>>it is LOOKING for a leader after a connection drop.
>>Right now it waits 15 minutes before it does that and re-iniates a
>>connection
>>
>> if (senderWorkerMap.get(sid) == null){
>>   } else {
>>            LOG.debug("There is a connection already for server " + sid);
>>
>>If the ping to the server is not working how is it able to send
>>notifications . The logs says Sending notifications but the ping is not
>>working at that time.
>>
>>I am not sure how to decrease the 15 minute wait period before the
>>Observer
>>re-iniates a connection.
>>Or Alternatively how to make this work
>>
>>
>>On Wed, Oct 14, 2015 at 12:50 PM, Flavio Junqueira <fp...@apache.org>
>>wrote:
>>
>>> The snapshot is really small, so it is unlikely to be it, unless the
>>> connection is really crappy.
>>>
>>> -Flavio
>>>
>>> > On 14 Oct 2015, at 19:38, elastic search <el...@gmail.com>
>>>wrote:
>>> >
>>> > Hi Flavio
>>> >
>>> > Thanks for the reply.
>>> >
>>> > These are the config parameters
>>> > tickTime=2000
>>> > initLimit=10
>>> > syncLimit=5
>>> >
>>> > Data size is
>>> > -rw-r--r-- 1 xoom xoom  85K Oct  3 10:16 snapshot.170001badb
>>> > -rw-r--r-- 1 xoom xoom  65M Oct  3 10:16 log.170000da7f
>>> > -rw-r--r-- 1 xoom xoom  88K Oct 13 23:30 snapshot.170002bfcd
>>> > -rw-r--r-- 1 xoom xoom  65M Oct 13 23:30 log.170001badd
>>> > -rw-r--r-- 1 xoom xoom  65M Oct 14 11:36 log.170002bfcf
>>> >
>>> > These are all defaults and i have never deleted any log files.
>>> > Do you still think this could be due to syncLimit or the snapshot
>>>being
>>> > large
>>> >
>>> >
>>> >
>>> > On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <fp...@apache.org>
>>> wrote:
>>> >
>>> >>
>>> >>> On 14 Oct 2015, at 17:22, elastic search <el...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> The link between the AWS and the local DataCenter was down for
>>>around 2
>>> >>> minutes.
>>> >>> I have been running ping continuously from the DataCenter to the
>>>AWS
>>> and
>>> >>> that wasn't responding for a few minutes.
>>> >>>
>>> >>> Are you saying since we see Send Notification messages in the logs
>>>,
>>> that
>>> >>> would mean the Observers are able to connect to the ZK .
>>> >>> Only that the ZK server leader is not able to respond back ?
>>> >>
>>> >> Yeah, the logs indicate that the observer is being able to connect
>>> ("There
>>> >> is a connection already for server..." in the logs) but isn't able
>>>to
>>> make
>>> >> progress, which implies that it isn't receiving responses.
>>> >>
>>> >>>
>>> >>> This is what i see from the Server logs
>>> >>> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/
>>> >> 10.10.4.46:38161
>>> >>> :LearnerHandler@633] - Unexpected exception causing shutdown while
>>> sock
>>> >>> still open
>>> >>> 2015-10-09 16:19:28,160 [myid:3] - WARN
>>> >>> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken
>>>for
>>> >> id
>>> >>> 5, my id = 3, error =
>>> >>>
>>> >>
>>> >> These lines belong to different parts of the code path. The first
>>> actually
>>> >> indicates that the observer found the leader, but could not keep
>>>going.
>>> For
>>> >> this particular time period, it could be that the observer is trying
>>>to
>>> >> sync up with the leader, but the value of syncLimit is too low and
>>>the
>>> >> leader is abandoning the observer. One reason for that to happen is
>>>the
>>> >> snapshot being large (or your available bandwidth being low).
>>> >>
>>> >> The second line is from the part of the code that does connections
>>>for
>>> >> leader election. That one also says that the connection broke.
>>> >>
>>> >> -Flavio
>>> >>
>>> >>
>>> >>> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fp...@apache.org>
>>> >> wrote:
>>> >>>
>>> >>>> Can you tell why the server wasn't responding to the notifications
>>> from
>>> >>>> the observer? The log file is from the observer and it sounds like
>>>it
>>> is
>>> >>>> being able to send messages out, but it isn't clear why the server
>>> isn't
>>> >>>> responding.
>>> >>>>
>>> >>>> -Flavio
>>> >>>>
>>> >>>>> On 14 Oct 2015, at 01:51, elastic search <el...@gmail.com>
>>> >> wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>> Hello Experts
>>> >>>>>
>>> >>>>> We have 2 Observers running in AWS connecting over to local ZK
>>> Ensemble
>>> >>>> in our own DataCenter.
>>> >>>>>
>>> >>>>> There have been instances where we see network drop for a minute
>>> >> between
>>> >>>> the networks.
>>> >>>>> However the Observers take around 15 minutes to recover even if
>>>the
>>> >>>> network outage is for a minute.
>>> >>>>>
>>> >>>>> From the logs
>>> >>>>> java.net.SocketTimeoutException: Read timed out
>>> >>>>> 2015-10-13 22:26:03,927 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 400
>>> >>>>> 2015-10-13 22:26:04,328 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 800
>>> >>>>> 2015-10-13 22:26:05,129 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 1600
>>> >>>>> 2015-10-13 22:26:06,730 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 3200
>>> >>>>> 2015-10-13 22:26:09,931 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 6400
>>> >>>>> 2015-10-13 22:26:16,332 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 12800
>>> >>>>> 2015-10-13 22:26:29,133 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 25600
>>> >>>>> 2015-10-13 22:26:54,734 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 51200
>>> >>>>> 2015-10-13 22:27:45,935 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:28:45,936 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:29:45,937 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:30:45,938 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:31:45,939 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:32:45,940 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:33:45,941 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:34:45,942 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:35:45,943 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:36:45,944 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:37:45,945 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:38:45,946 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:39:45,947 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:40:45,948 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>> 2015-10-13 22:41:45,949 [myid:4] - INFO
>>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>> >>>> Notification time out: 60000
>>> >>>>>
>>> >>>>> And then finally exits the QuorumCnxManager run loop with the
>>> following
>>> >>>> message
>>> >>>>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
>>> >>>> broken for id 2
>>> >>>>>
>>> >>>>> How can we ensure the observer does not go out for service such a
>>> long
>>> >>>> duration ?
>>> >>>>>
>>> >>>>> Attached the full logs
>>> >>>>>
>>> >>>>> Please help
>>> >>>>> Thanks
>>> >>>>>
>>> >>>>> <zookeeper.log.zip>
>>> >>>>
>>> >>>>
>>> >>
>>> >>
>>>
>>>
>

Re: Observers Unusable

Posted by Bob Sheehan <bs...@vmware.com>.

We face a similar issue with remote observers connecting to 3 node
cluster. Using netstat we see a send-q tcp traffic accumulating. We
believe this is a tcp timeout issue.

Bob


On 10/14/15, 2:54 PM, "elastic search" <el...@gmail.com> wrote:

>When it outputs this message "There is a connection already for server" ,
>is it really valid connection or using the existing map in memory to
>figure
>if the connection already exists.
>If that is the case it should empty this map and initate a connection when
>it is LOOKING for a leader after a connection drop.
>Right now it waits 15 minutes before it does that and re-iniates a
>connection
>
> if (senderWorkerMap.get(sid) == null){
>   } else {
>            LOG.debug("There is a connection already for server " + sid);
>
>If the ping to the server is not working how is it able to send
>notifications . The logs says Sending notifications but the ping is not
>working at that time.
>
>I am not sure how to decrease the 15 minute wait period before the
>Observer
>re-iniates a connection.
>Or Alternatively how to make this work
>
>
>On Wed, Oct 14, 2015 at 12:50 PM, Flavio Junqueira <fp...@apache.org> wrote:
>
>> The snapshot is really small, so it is unlikely to be it, unless the
>> connection is really crappy.
>>
>> -Flavio
>>
>> > On 14 Oct 2015, at 19:38, elastic search <el...@gmail.com>
>>wrote:
>> >
>> > Hi Flavio
>> >
>> > Thanks for the reply.
>> >
>> > These are the config parameters
>> > tickTime=2000
>> > initLimit=10
>> > syncLimit=5
>> >
>> > Data size is
>> > -rw-r--r-- 1 xoom xoom  85K Oct  3 10:16 snapshot.170001badb
>> > -rw-r--r-- 1 xoom xoom  65M Oct  3 10:16 log.170000da7f
>> > -rw-r--r-- 1 xoom xoom  88K Oct 13 23:30 snapshot.170002bfcd
>> > -rw-r--r-- 1 xoom xoom  65M Oct 13 23:30 log.170001badd
>> > -rw-r--r-- 1 xoom xoom  65M Oct 14 11:36 log.170002bfcf
>> >
>> > These are all defaults and i have never deleted any log files.
>> > Do you still think this could be due to syncLimit or the snapshot
>>being
>> > large
>> >
>> >
>> >
>> > On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <fp...@apache.org>
>> wrote:
>> >
>> >>
>> >>> On 14 Oct 2015, at 17:22, elastic search <el...@gmail.com>
>> wrote:
>> >>>
>> >>> The link between the AWS and the local DataCenter was down for
>>around 2
>> >>> minutes.
>> >>> I have been running ping continuously from the DataCenter to the AWS
>> and
>> >>> that wasn't responding for a few minutes.
>> >>>
>> >>> Are you saying since we see Send Notification messages in the logs ,
>> that
>> >>> would mean the Observers are able to connect to the ZK .
>> >>> Only that the ZK server leader is not able to respond back ?
>> >>
>> >> Yeah, the logs indicate that the observer is being able to connect
>> ("There
>> >> is a connection already for server..." in the logs) but isn't able to
>> make
>> >> progress, which implies that it isn't receiving responses.
>> >>
>> >>>
>> >>> This is what i see from the Server logs
>> >>> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/
>> >> 10.10.4.46:38161
>> >>> :LearnerHandler@633] - Unexpected exception causing shutdown while
>> sock
>> >>> still open
>> >>> 2015-10-09 16:19:28,160 [myid:3] - WARN
>> >>> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken
>>for
>> >> id
>> >>> 5, my id = 3, error =
>> >>>
>> >>
>> >> These lines belong to different parts of the code path. The first
>> actually
>> >> indicates that the observer found the leader, but could not keep
>>going.
>> For
>> >> this particular time period, it could be that the observer is trying
>>to
>> >> sync up with the leader, but the value of syncLimit is too low and
>>the
>> >> leader is abandoning the observer. One reason for that to happen is
>>the
>> >> snapshot being large (or your available bandwidth being low).
>> >>
>> >> The second line is from the part of the code that does connections
>>for
>> >> leader election. That one also says that the connection broke.
>> >>
>> >> -Flavio
>> >>
>> >>
>> >>> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fp...@apache.org>
>> >> wrote:
>> >>>
>> >>>> Can you tell why the server wasn't responding to the notifications
>> from
>> >>>> the observer? The log file is from the observer and it sounds like
>>it
>> is
>> >>>> being able to send messages out, but it isn't clear why the server
>> isn't
>> >>>> responding.
>> >>>>
>> >>>> -Flavio
>> >>>>
>> >>>>> On 14 Oct 2015, at 01:51, elastic search <el...@gmail.com>
>> >> wrote:
>> >>>>>
>> >>>>>
>> >>>>> Hello Experts
>> >>>>>
>> >>>>> We have 2 Observers running in AWS connecting over to local ZK
>> Ensemble
>> >>>> in our own DataCenter.
>> >>>>>
>> >>>>> There have been instances where we see network drop for a minute
>> >> between
>> >>>> the networks.
>> >>>>> However the Observers take around 15 minutes to recover even if
>>the
>> >>>> network outage is for a minute.
>> >>>>>
>> >>>>> From the logs
>> >>>>> java.net.SocketTimeoutException: Read timed out
>> >>>>> 2015-10-13 22:26:03,927 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 400
>> >>>>> 2015-10-13 22:26:04,328 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 800
>> >>>>> 2015-10-13 22:26:05,129 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 1600
>> >>>>> 2015-10-13 22:26:06,730 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 3200
>> >>>>> 2015-10-13 22:26:09,931 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 6400
>> >>>>> 2015-10-13 22:26:16,332 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 12800
>> >>>>> 2015-10-13 22:26:29,133 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 25600
>> >>>>> 2015-10-13 22:26:54,734 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 51200
>> >>>>> 2015-10-13 22:27:45,935 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:28:45,936 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:29:45,937 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:30:45,938 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:31:45,939 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:32:45,940 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:33:45,941 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:34:45,942 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:35:45,943 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:36:45,944 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:37:45,945 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:38:45,946 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:39:45,947 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:40:45,948 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>> 2015-10-13 22:41:45,949 [myid:4] - INFO
>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> >>>> Notification time out: 60000
>> >>>>>
>> >>>>> And then finally exits the QuorumCnxManager run loop with the
>> following
>> >>>> message
>> >>>>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
>> >>>> broken for id 2
>> >>>>>
>> >>>>> How can we ensure the observer does not go out for service such a
>> long
>> >>>> duration ?
>> >>>>>
>> >>>>> Attached the full logs
>> >>>>>
>> >>>>> Please help
>> >>>>> Thanks
>> >>>>>
>> >>>>> <zookeeper.log.zip>
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>

Re: Observers Unusable

Posted by elastic search <el...@gmail.com>.

When it outputs this message "There is a connection already for server" ,
is it really valid connection or using the existing map in memory to figure
if the connection already exists.
If that is the case it should empty this map and initate a connection when
it is LOOKING for a leader after a connection drop.
Right now it waits 15 minutes before it does that and re-iniates a
connection

 if (senderWorkerMap.get(sid) == null){
   } else {
            LOG.debug("There is a connection already for server " + sid);

If the ping to the server is not working how is it able to send
notifications . The logs says Sending notifications but the ping is not
working at that time.

I am not sure how to decrease the 15 minute wait period before the Observer
re-iniates a connection.
Or Alternatively how to make this work


On Wed, Oct 14, 2015 at 12:50 PM, Flavio Junqueira <fp...@apache.org> wrote:

> The snapshot is really small, so it is unlikely to be it, unless the
> connection is really crappy.
>
> -Flavio
>
> > On 14 Oct 2015, at 19:38, elastic search <el...@gmail.com> wrote:
> >
> > Hi Flavio
> >
> > Thanks for the reply.
> >
> > These are the config parameters
> > tickTime=2000
> > initLimit=10
> > syncLimit=5
> >
> > Data size is
> > -rw-r--r-- 1 xoom xoom  85K Oct  3 10:16 snapshot.170001badb
> > -rw-r--r-- 1 xoom xoom  65M Oct  3 10:16 log.170000da7f
> > -rw-r--r-- 1 xoom xoom  88K Oct 13 23:30 snapshot.170002bfcd
> > -rw-r--r-- 1 xoom xoom  65M Oct 13 23:30 log.170001badd
> > -rw-r--r-- 1 xoom xoom  65M Oct 14 11:36 log.170002bfcf
> >
> > These are all defaults and i have never deleted any log files.
> > Do you still think this could be due to syncLimit or the snapshot being
> > large
> >
> >
> >
> > On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <fp...@apache.org>
> wrote:
> >
> >>
> >>> On 14 Oct 2015, at 17:22, elastic search <el...@gmail.com>
> wrote:
> >>>
> >>> The link between the AWS and the local DataCenter was down for around 2
> >>> minutes.
> >>> I have been running ping continuously from the DataCenter to the AWS
> and
> >>> that wasn't responding for a few minutes.
> >>>
> >>> Are you saying since we see Send Notification messages in the logs ,
> that
> >>> would mean the Observers are able to connect to the ZK .
> >>> Only that the ZK server leader is not able to respond back ?
> >>
> >> Yeah, the logs indicate that the observer is being able to connect
> ("There
> >> is a connection already for server..." in the logs) but isn't able to
> make
> >> progress, which implies that it isn't receiving responses.
> >>
> >>>
> >>> This is what i see from the Server logs
> >>> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/
> >> 10.10.4.46:38161
> >>> :LearnerHandler@633] - Unexpected exception causing shutdown while
> sock
> >>> still open
> >>> 2015-10-09 16:19:28,160 [myid:3] - WARN
> >>> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken for
> >> id
> >>> 5, my id = 3, error =
> >>>
> >>
> >> These lines belong to different parts of the code path. The first
> actually
> >> indicates that the observer found the leader, but could not keep going.
> For
> >> this particular time period, it could be that the observer is trying to
> >> sync up with the leader, but the value of syncLimit is too low and the
> >> leader is abandoning the observer. One reason for that to happen is the
> >> snapshot being large (or your available bandwidth being low).
> >>
> >> The second line is from the part of the code that does connections for
> >> leader election. That one also says that the connection broke.
> >>
> >> -Flavio
> >>
> >>
> >>> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fp...@apache.org>
> >> wrote:
> >>>
> >>>> Can you tell why the server wasn't responding to the notifications
> from
> >>>> the observer? The log file is from the observer and it sounds like it
> is
> >>>> being able to send messages out, but it isn't clear why the server
> isn't
> >>>> responding.
> >>>>
> >>>> -Flavio
> >>>>
> >>>>> On 14 Oct 2015, at 01:51, elastic search <el...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>
> >>>>> Hello Experts
> >>>>>
> >>>>> We have 2 Observers running in AWS connecting over to local ZK
> Ensemble
> >>>> in our own DataCenter.
> >>>>>
> >>>>> There have been instances where we see network drop for a minute
> >> between
> >>>> the networks.
> >>>>> However the Observers take around 15 minutes to recover even if the
> >>>> network outage is for a minute.
> >>>>>
> >>>>> From the logs
> >>>>> java.net.SocketTimeoutException: Read timed out
> >>>>> 2015-10-13 22:26:03,927 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 400
> >>>>> 2015-10-13 22:26:04,328 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 800
> >>>>> 2015-10-13 22:26:05,129 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 1600
> >>>>> 2015-10-13 22:26:06,730 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 3200
> >>>>> 2015-10-13 22:26:09,931 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 6400
> >>>>> 2015-10-13 22:26:16,332 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 12800
> >>>>> 2015-10-13 22:26:29,133 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 25600
> >>>>> 2015-10-13 22:26:54,734 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 51200
> >>>>> 2015-10-13 22:27:45,935 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:28:45,936 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:29:45,937 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:30:45,938 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:31:45,939 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:32:45,940 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:33:45,941 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:34:45,942 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:35:45,943 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:36:45,944 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:37:45,945 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:38:45,946 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:39:45,947 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:40:45,948 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>> 2015-10-13 22:41:45,949 [myid:4] - INFO
> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >>>> Notification time out: 60000
> >>>>>
> >>>>> And then finally exits the QuorumCnxManager run loop with the
> following
> >>>> message
> >>>>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
> >>>> broken for id 2
> >>>>>
> >>>>> How can we ensure the observer does not go out for service such a
> long
> >>>> duration ?
> >>>>>
> >>>>> Attached the full logs
> >>>>>
> >>>>> Please help
> >>>>> Thanks
> >>>>>
> >>>>> <zookeeper.log.zip>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Observers Unusable

Posted by Flavio Junqueira <fp...@apache.org>.

The snapshot is really small, so it is unlikely to be it, unless the connection is really crappy.

-Flavio

> On 14 Oct 2015, at 19:38, elastic search <el...@gmail.com> wrote:
> 
> Hi Flavio
> 
> Thanks for the reply.
> 
> These are the config parameters
> tickTime=2000
> initLimit=10
> syncLimit=5
> 
> Data size is
> -rw-r--r-- 1 xoom xoom  85K Oct  3 10:16 snapshot.170001badb
> -rw-r--r-- 1 xoom xoom  65M Oct  3 10:16 log.170000da7f
> -rw-r--r-- 1 xoom xoom  88K Oct 13 23:30 snapshot.170002bfcd
> -rw-r--r-- 1 xoom xoom  65M Oct 13 23:30 log.170001badd
> -rw-r--r-- 1 xoom xoom  65M Oct 14 11:36 log.170002bfcf
> 
> These are all defaults and i have never deleted any log files.
> Do you still think this could be due to syncLimit or the snapshot being
> large
> 
> 
> 
> On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <fp...@apache.org> wrote:
> 
>> 
>>> On 14 Oct 2015, at 17:22, elastic search <el...@gmail.com> wrote:
>>> 
>>> The link between the AWS and the local DataCenter was down for around 2
>>> minutes.
>>> I have been running ping continuously from the DataCenter to the AWS and
>>> that wasn't responding for a few minutes.
>>> 
>>> Are you saying since we see Send Notification messages in the logs , that
>>> would mean the Observers are able to connect to the ZK .
>>> Only that the ZK server leader is not able to respond back ?
>> 
>> Yeah, the logs indicate that the observer is being able to connect ("There
>> is a connection already for server..." in the logs) but isn't able to make
>> progress, which implies that it isn't receiving responses.
>> 
>>> 
>>> This is what i see from the Server logs
>>> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/
>> 10.10.4.46:38161
>>> :LearnerHandler@633] - Unexpected exception causing shutdown while sock
>>> still open
>>> 2015-10-09 16:19:28,160 [myid:3] - WARN
>>> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken for
>> id
>>> 5, my id = 3, error =
>>> 
>> 
>> These lines belong to different parts of the code path. The first actually
>> indicates that the observer found the leader, but could not keep going. For
>> this particular time period, it could be that the observer is trying to
>> sync up with the leader, but the value of syncLimit is too low and the
>> leader is abandoning the observer. One reason for that to happen is the
>> snapshot being large (or your available bandwidth being low).
>> 
>> The second line is from the part of the code that does connections for
>> leader election. That one also says that the connection broke.
>> 
>> -Flavio
>> 
>> 
>>> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fp...@apache.org>
>> wrote:
>>> 
>>>> Can you tell why the server wasn't responding to the notifications from
>>>> the observer? The log file is from the observer and it sounds like it is
>>>> being able to send messages out, but it isn't clear why the server isn't
>>>> responding.
>>>> 
>>>> -Flavio
>>>> 
>>>>> On 14 Oct 2015, at 01:51, elastic search <el...@gmail.com>
>> wrote:
>>>>> 
>>>>> 
>>>>> Hello Experts
>>>>> 
>>>>> We have 2 Observers running in AWS connecting over to local ZK Ensemble
>>>> in our own DataCenter.
>>>>> 
>>>>> There have been instances where we see network drop for a minute
>> between
>>>> the networks.
>>>>> However the Observers take around 15 minutes to recover even if the
>>>> network outage is for a minute.
>>>>> 
>>>>> From the logs
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>> 2015-10-13 22:26:03,927 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 400
>>>>> 2015-10-13 22:26:04,328 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 800
>>>>> 2015-10-13 22:26:05,129 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 1600
>>>>> 2015-10-13 22:26:06,730 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 3200
>>>>> 2015-10-13 22:26:09,931 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 6400
>>>>> 2015-10-13 22:26:16,332 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 12800
>>>>> 2015-10-13 22:26:29,133 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 25600
>>>>> 2015-10-13 22:26:54,734 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 51200
>>>>> 2015-10-13 22:27:45,935 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:28:45,936 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:29:45,937 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:30:45,938 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:31:45,939 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:32:45,940 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:33:45,941 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:34:45,942 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:35:45,943 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:36:45,944 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:37:45,945 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:38:45,946 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:39:45,947 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:40:45,948 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 2015-10-13 22:41:45,949 [myid:4] - INFO
>>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>>>> Notification time out: 60000
>>>>> 
>>>>> And then finally exits the QuorumCnxManager run loop with the following
>>>> message
>>>>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
>>>> broken for id 2
>>>>> 
>>>>> How can we ensure the observer does not go out for service such a long
>>>> duration ?
>>>>> 
>>>>> Attached the full logs
>>>>> 
>>>>> Please help
>>>>> Thanks
>>>>> 
>>>>> <zookeeper.log.zip>
>>>> 
>>>> 
>> 
>>

Re: Observers Unusable

Posted by elastic search <el...@gmail.com>.

Hi Flavio

Thanks for the reply.

These are the config parameters
tickTime=2000
initLimit=10
syncLimit=5

Data size is
-rw-r--r-- 1 xoom xoom  85K Oct  3 10:16 snapshot.170001badb
-rw-r--r-- 1 xoom xoom  65M Oct  3 10:16 log.170000da7f
-rw-r--r-- 1 xoom xoom  88K Oct 13 23:30 snapshot.170002bfcd
-rw-r--r-- 1 xoom xoom  65M Oct 13 23:30 log.170001badd
-rw-r--r-- 1 xoom xoom  65M Oct 14 11:36 log.170002bfcf

These are all defaults and i have never deleted any log files.
Do you still think this could be due to syncLimit or the snapshot being
large



On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <fp...@apache.org> wrote:

>
> > On 14 Oct 2015, at 17:22, elastic search <el...@gmail.com> wrote:
> >
> > The link between the AWS and the local DataCenter was down for around 2
> > minutes.
> > I have been running ping continuously from the DataCenter to the AWS and
> > that wasn't responding for a few minutes.
> >
> > Are you saying since we see Send Notification messages in the logs , that
> > would mean the Observers are able to connect to the ZK .
> > Only that the ZK server leader is not able to respond back ?
>
> Yeah, the logs indicate that the observer is being able to connect ("There
> is a connection already for server..." in the logs) but isn't able to make
> progress, which implies that it isn't receiving responses.
>
> >
> > This is what i see from the Server logs
> > 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/
> 10.10.4.46:38161
> > :LearnerHandler@633] - Unexpected exception causing shutdown while sock
> > still open
> > 2015-10-09 16:19:28,160 [myid:3] - WARN
> > [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken for
> id
> > 5, my id = 3, error =
> >
>
> These lines belong to different parts of the code path. The first actually
> indicates that the observer found the leader, but could not keep going. For
> this particular time period, it could be that the observer is trying to
> sync up with the leader, but the value of syncLimit is too low and the
> leader is abandoning the observer. One reason for that to happen is the
> snapshot being large (or your available bandwidth being low).
>
> The second line is from the part of the code that does connections for
> leader election. That one also says that the connection broke.
>
> -Flavio
>
>
> > On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fp...@apache.org>
> wrote:
> >
> >> Can you tell why the server wasn't responding to the notifications from
> >> the observer? The log file is from the observer and it sounds like it is
> >> being able to send messages out, but it isn't clear why the server isn't
> >> responding.
> >>
> >> -Flavio
> >>
> >>> On 14 Oct 2015, at 01:51, elastic search <el...@gmail.com>
> wrote:
> >>>
> >>>
> >>> Hello Experts
> >>>
> >>> We have 2 Observers running in AWS connecting over to local ZK Ensemble
> >> in our own DataCenter.
> >>>
> >>> There have been instances where we see network drop for a minute
> between
> >> the networks.
> >>> However the Observers take around 15 minutes to recover even if the
> >> network outage is for a minute.
> >>>
> >>> From the logs
> >>> java.net.SocketTimeoutException: Read timed out
> >>> 2015-10-13 22:26:03,927 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 400
> >>> 2015-10-13 22:26:04,328 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 800
> >>> 2015-10-13 22:26:05,129 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 1600
> >>> 2015-10-13 22:26:06,730 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 3200
> >>> 2015-10-13 22:26:09,931 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 6400
> >>> 2015-10-13 22:26:16,332 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 12800
> >>> 2015-10-13 22:26:29,133 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 25600
> >>> 2015-10-13 22:26:54,734 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 51200
> >>> 2015-10-13 22:27:45,935 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:28:45,936 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:29:45,937 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:30:45,938 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:31:45,939 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:32:45,940 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:33:45,941 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:34:45,942 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:35:45,943 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:36:45,944 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:37:45,945 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:38:45,946 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:39:45,947 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:40:45,948 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>> 2015-10-13 22:41:45,949 [myid:4] - INFO
> >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> >> Notification time out: 60000
> >>>
> >>> And then finally exits the QuorumCnxManager run loop with the following
> >> message
> >>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
> >> broken for id 2
> >>>
> >>> How can we ensure the observer does not go out for service such a long
> >> duration ?
> >>>
> >>> Attached the full logs
> >>>
> >>> Please help
> >>> Thanks
> >>>
> >>> <zookeeper.log.zip>
> >>
> >>
>
>

Re: Observers Unusable

Posted by Flavio Junqueira <fp...@apache.org>.

> On 14 Oct 2015, at 17:22, elastic search <el...@gmail.com> wrote:
> 
> The link between the AWS and the local DataCenter was down for around 2
> minutes.
> I have been running ping continuously from the DataCenter to the AWS and
> that wasn't responding for a few minutes.
> 
> Are you saying since we see Send Notification messages in the logs , that
> would mean the Observers are able to connect to the ZK .
> Only that the ZK server leader is not able to respond back ?

Yeah, the logs indicate that the observer is being able to connect ("There is a connection already for server..." in the logs) but isn't able to make progress, which implies that it isn't receiving responses. 

> 
> This is what i see from the Server logs
> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/10.10.4.46:38161
> :LearnerHandler@633] - Unexpected exception causing shutdown while sock
> still open
> 2015-10-09 16:19:28,160 [myid:3] - WARN
> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken for id
> 5, my id = 3, error =
> 

These lines belong to different parts of the code path. The first actually indicates that the observer found the leader, but could not keep going. For this particular time period, it could be that the observer is trying to sync up with the leader, but the value of syncLimit is too low and the leader is abandoning the observer. One reason for that to happen is the snapshot being large (or your available bandwidth being low).

The second line is from the part of the code that does connections for leader election. That one also says that the connection broke.  

-Flavio


> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fp...@apache.org> wrote:
> 
>> Can you tell why the server wasn't responding to the notifications from
>> the observer? The log file is from the observer and it sounds like it is
>> being able to send messages out, but it isn't clear why the server isn't
>> responding.
>> 
>> -Flavio
>> 
>>> On 14 Oct 2015, at 01:51, elastic search <el...@gmail.com> wrote:
>>> 
>>> 
>>> Hello Experts
>>> 
>>> We have 2 Observers running in AWS connecting over to local ZK Ensemble
>> in our own DataCenter.
>>> 
>>> There have been instances where we see network drop for a minute between
>> the networks.
>>> However the Observers take around 15 minutes to recover even if the
>> network outage is for a minute.
>>> 
>>> From the logs
>>> java.net.SocketTimeoutException: Read timed out
>>> 2015-10-13 22:26:03,927 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 400
>>> 2015-10-13 22:26:04,328 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 800
>>> 2015-10-13 22:26:05,129 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 1600
>>> 2015-10-13 22:26:06,730 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 3200
>>> 2015-10-13 22:26:09,931 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 6400
>>> 2015-10-13 22:26:16,332 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 12800
>>> 2015-10-13 22:26:29,133 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 25600
>>> 2015-10-13 22:26:54,734 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 51200
>>> 2015-10-13 22:27:45,935 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:28:45,936 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:29:45,937 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:30:45,938 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:31:45,939 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:32:45,940 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:33:45,941 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:34:45,942 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:35:45,943 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:36:45,944 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:37:45,945 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:38:45,946 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:39:45,947 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:40:45,948 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:41:45,949 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 
>>> And then finally exits the QuorumCnxManager run loop with the following
>> message
>>> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
>> broken for id 2
>>> 
>>> How can we ensure the observer does not go out for service such a long
>> duration ?
>>> 
>>> Attached the full logs
>>> 
>>> Please help
>>> Thanks
>>> 
>>> <zookeeper.log.zip>
>> 
>>

Re: Observers Unusable

Posted by elastic search <el...@gmail.com>.

The link between the AWS and the local DataCenter was down for around 2
minutes.
I have been running ping continuously from the DataCenter to the AWS and
that wasn't responding for a few minutes.

Are you saying since we see Send Notification messages in the logs , that
would mean the Observers are able to connect to the ZK .
Only that the ZK server leader is not able to respond back ?

This is what i see from the Server logs
2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/10.10.4.46:38161
:LearnerHandler@633] - Unexpected exception causing shutdown while sock
still open
2015-10-09 16:19:28,160 [myid:3] - WARN
[RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken for id
5, my id = 3, error =

On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <fp...@apache.org> wrote:

> Can you tell why the server wasn't responding to the notifications from
> the observer? The log file is from the observer and it sounds like it is
> being able to send messages out, but it isn't clear why the server isn't
> responding.
>
> -Flavio
>
> > On 14 Oct 2015, at 01:51, elastic search <el...@gmail.com> wrote:
> >
> >
> > Hello Experts
> >
> > We have 2 Observers running in AWS connecting over to local ZK Ensemble
> in our own DataCenter.
> >
> > There have been instances where we see network drop for a minute between
> the networks.
> > However the Observers take around 15 minutes to recover even if the
> network outage is for a minute.
> >
> > From the logs
> > java.net.SocketTimeoutException: Read timed out
> > 2015-10-13 22:26:03,927 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 400
> > 2015-10-13 22:26:04,328 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 800
> > 2015-10-13 22:26:05,129 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 1600
> > 2015-10-13 22:26:06,730 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 3200
> > 2015-10-13 22:26:09,931 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 6400
> > 2015-10-13 22:26:16,332 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 12800
> > 2015-10-13 22:26:29,133 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 25600
> > 2015-10-13 22:26:54,734 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 51200
> > 2015-10-13 22:27:45,935 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:28:45,936 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:29:45,937 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:30:45,938 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:31:45,939 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:32:45,940 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:33:45,941 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:34:45,942 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:35:45,943 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:36:45,944 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:37:45,945 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:38:45,946 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:39:45,947 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:40:45,948 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> > 2015-10-13 22:41:45,949 [myid:4] - INFO
> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> >
> > And then finally exits the QuorumCnxManager run loop with the following
> message
> > WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
> broken for id 2
> >
> > How can we ensure the observer does not go out for service such a long
> duration ?
> >
> > Attached the full logs
> >
> > Please help
> > Thanks
> >
> > <zookeeper.log.zip>
>
>

Re: Observers Unusable

Posted by Flavio Junqueira <fp...@apache.org>.

Can you tell why the server wasn't responding to the notifications from the observer? The log file is from the observer and it sounds like it is being able to send messages out, but it isn't clear why the server isn't responding.

-Flavio

> On 14 Oct 2015, at 01:51, elastic search <el...@gmail.com> wrote:
> 
> 
> Hello Experts
> 
> We have 2 Observers running in AWS connecting over to local ZK Ensemble in our own DataCenter.
> 
> There have been instances where we see network drop for a minute between the networks.
> However the Observers take around 15 minutes to recover even if the network outage is for a minute.
> 
> From the logs
> java.net.SocketTimeoutException: Read timed out
> 2015-10-13 22:26:03,927 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 400
> 2015-10-13 22:26:04,328 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 800
> 2015-10-13 22:26:05,129 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 1600
> 2015-10-13 22:26:06,730 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 3200
> 2015-10-13 22:26:09,931 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 6400
> 2015-10-13 22:26:16,332 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 12800
> 2015-10-13 22:26:29,133 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 25600
> 2015-10-13 22:26:54,734 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 51200
> 2015-10-13 22:27:45,935 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:28:45,936 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:29:45,937 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:30:45,938 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:31:45,939 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:32:45,940 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:33:45,941 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:34:45,942 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:35:45,943 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:36:45,944 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:37:45,945 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:38:45,946 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:39:45,947 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:40:45,948 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 2015-10-13 22:41:45,949 [myid:4] - INFO  [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 60000
> 
> And then finally exits the QuorumCnxManager run loop with the following message
> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection broken for id 2
> 
> How can we ensure the observer does not go out for service such a long duration ?
> 
> Attached the full logs 
> 
> Please help
> Thanks
> 
> <zookeeper.log.zip>