You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by s influxdb <el...@gmail.com> on 2016/04/07 06:41:18 UTC

node 2 not rejoining cluster

We had one of the node giving OOM java.lang.OutOfMemoryError: unable to
create new native thread and then being unresponsive.

We tried to add the node back to the cluster but with no luck.

It doesn't seem to "Receive any notification "  messages from the other
nodes.
Keeps "Sending notifications " in loop

Please see attached the logs of the node that is out of rotation.

Any inputs appreciated.

Thanks

Re: node 2 not rejoining cluster

Posted by s influxdb <el...@gmail.com>.

Is this issue related to the recent email thread

"Working around Leader election Listner thread death"

 https://issues.apache.org/jira/browse/ZOOKEEPER-2186

On Thu, Apr 14, 2016 at 2:54 AM, Flavio Junqueira <fp...@apache.org> wrote:

> Other than some kind of funky packet filtering rule, I'm not sure why
> you'd not be receiving the ACKs.
>
> I think that reconfiguring isn't the right way of addressing the problem.
> If you have some underlying issue, configuration or even bad hardware, then
> adding more nodes will not fix it. Even worse, it might lurking there for
> some time and might come back to bite you later.
>
> If you do lose a machine (e.g., permanent failure, decommission), then it
> does make sense to reconfigure the ensemble.
>
> -Flavio
>
>
> > On 14 Apr 2016, at 01:12, s influxdb <el...@gmail.com> wrote:
> >
> > Thanks Flavio.
> >
> > Would you know why node2 could not receive ACK from the other 2 nodes .
> >
> > What is the workaround in scenarios like these where in a 3 node cluster
> 1 node is not responding
> > ** If we do a rolling restart there is a possiblity of a downtime
> > ** Add 2 more nodes to the configs and do a rolling restart
> > ** Could you think of any way to fix node 2 so that it rejoins the
> cluster.
> >
> > Would appreciate your reply.
> >
> >
> >
> > On Tue, Apr 12, 2016 at 1:33 AM, Flavio Junqueira <fpj@apache.org
> <ma...@apache.org>> wrote:
> > Good to hear you've been able to sort it out.
> >
> > -Flavio
> >
> > > On 12 Apr 2016, at 03:02, s influxdb <elastic.l.k@gmail.com <mailto:
> elastic.l.k@gmail.com>> wrote:
> > >
> > > created a parallel independant zookeeper cluster on the same set of
> > > machines with different ports and that worked. This indicates the port
> was
> > > the issue.
> > >
> > > On Mon, Apr 11, 2016 at 1:35 PM, s influxdb <elastic.l.k@gmail.com
> <ma...@gmail.com>> wrote:
> > >
> > >> reboot of the server didn't help
> > >>
> > >> On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <elastic.l.k@gmail.com
> <ma...@gmail.com>> wrote:
> > >>
> > >>> I ran tcpdump on all the three nodes.
> > >>> It looks like that for every  [PSH, ACK] there is a missing [ACK]
> from
> > >>> the other nodes to this 2nd node on port 3888.
> > >>>
> > >>>
> > >>> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <elastic.l.k@gmail.com
> <ma...@gmail.com>> wrote:
> > >>>
> > >>>> Thanks Flavio for your quick replies.
> > >>>> The zookeeper version is 3.4.6
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <fpj@apache.org
> <ma...@apache.org>>
> > >>>> wrote:
> > >>>>
> > >>>>> You need to determine why it is not receiving notification
> messages.
> > >>>>> From
> > >>>>> the information you've given, it doesn't look like a zookeeper code
> > >>>>> issue.
> > >>>>>
> > >>>>> BTW, which version are you using?
> > >>>>>
> > >>>>> -Flavio
> > >>>>> On 7 Apr 2016 21:20, "s influxdb" <elastic.l.k@gmail.com <mailto:
> elastic.l.k@gmail.com>> wrote:
> > >>>>>
> > >>>>>> nothin on the iptables firewall .
> > >>>>>>
> > >>>>>> What options do i have to reconnect this node to the cluster ?
> > >>>>>>
> > >>>>>>
> > >>>>>> On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <
> elastic.l.k@gmail.com <ma...@gmail.com>>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> telnet works on 2888 and 3888 to the other nodes. Now i see
> > >>>>>>> java.net.SocketTimeoutException: connect timed out messages in
> the
> > >>>>> logs
> > >>>>>> for
> > >>>>>>> node 2
> > >>>>>>>
> > >>>>>>> On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fpj@apache.org
> <ma...@apache.org>>
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>> I only see notifications from the node to itself. It says that
> it
> > >>>>> is
> > >>>>>>>> connected to 1, but it doesn't seem to be receiving the
> > >>>>> notification
> > >>>>>> from
> > >>>>>>>> 1. It also doesn't seem to be receiving the connection request
> > >>>>> from 3.
> > >>>>>>>>
> > >>>>>>>> Last time I've seen something like this was due to iptables
> rules,
> > >>>>> but
> > >>>>>> if
> > >>>>>>>> it was working before and no configuration has changed, then I
> > >>>>> don't
> > >>>>>> know
> > >>>>>>>> what it could be.
> > >>>>>>>>
> > >>>>>>>> -Flavio
> > >>>>>>>>
> > >>>>>>>>> On 07 Apr 2016, at 05:43, s influxdb <elastic.l.k@gmail.com
> <ma...@gmail.com>>
> > >>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> this is the pastie
> > >>>>>>>>> http://pastie.org/10788301 <http://pastie.org/10788301>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <
> > >>>>> elastic.l.k@gmail.com <ma...@gmail.com>>
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> We had one of the node giving OOM java.lang.OutOfMemoryError:
> > >>>>> unable
> > >>>>>> to
> > >>>>>>>>>> create new native thread and then being unresponsive.
> > >>>>>>>>>>
> > >>>>>>>>>> We tried to add the node back to the cluster but with no luck.
> > >>>>>>>>>>
> > >>>>>>>>>> It doesn't seem to "Receive any notification "  messages from
> > >>>>> the
> > >>>>>> other
> > >>>>>>>>>> nodes.
> > >>>>>>>>>> Keeps "Sending notifications " in loop
> > >>>>>>>>>>
> > >>>>>>>>>> Please see attached the logs of the node that is out of
> > >>>>> rotation.
> > >>>>>>>>>>
> > >>>>>>>>>> Any inputs appreciated.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> >
> >
>
>

Re: node 2 not rejoining cluster

Posted by Flavio Junqueira <fp...@apache.org>.

Other than some kind of funky packet filtering rule, I'm not sure why you'd not be receiving the ACKs.

I think that reconfiguring isn't the right way of addressing the problem. If you have some underlying issue, configuration or even bad hardware, then adding more nodes will not fix it. Even worse, it might lurking there for some time and might come back to bite you later.

If you do lose a machine (e.g., permanent failure, decommission), then it does make sense to reconfigure the ensemble.
 
-Flavio

  
> On 14 Apr 2016, at 01:12, s influxdb <el...@gmail.com> wrote:
> 
> Thanks Flavio. 
> 
> Would you know why node2 could not receive ACK from the other 2 nodes .
> 
> What is the workaround in scenarios like these where in a 3 node cluster 1 node is not responding
> ** If we do a rolling restart there is a possiblity of a downtime
> ** Add 2 more nodes to the configs and do a rolling restart
> ** Could you think of any way to fix node 2 so that it rejoins the cluster.
> 
> Would appreciate your reply.
> 
> 
> 
> On Tue, Apr 12, 2016 at 1:33 AM, Flavio Junqueira <fpj@apache.org <ma...@apache.org>> wrote:
> Good to hear you've been able to sort it out.
> 
> -Flavio
> 
> > On 12 Apr 2016, at 03:02, s influxdb <elastic.l.k@gmail.com <ma...@gmail.com>> wrote:
> >
> > created a parallel independant zookeeper cluster on the same set of
> > machines with different ports and that worked. This indicates the port was
> > the issue.
> >
> > On Mon, Apr 11, 2016 at 1:35 PM, s influxdb <elastic.l.k@gmail.com <ma...@gmail.com>> wrote:
> >
> >> reboot of the server didn't help
> >>
> >> On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <elastic.l.k@gmail.com <ma...@gmail.com>> wrote:
> >>
> >>> I ran tcpdump on all the three nodes.
> >>> It looks like that for every  [PSH, ACK] there is a missing [ACK] from
> >>> the other nodes to this 2nd node on port 3888.
> >>>
> >>>
> >>> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <elastic.l.k@gmail.com <ma...@gmail.com>> wrote:
> >>>
> >>>> Thanks Flavio for your quick replies.
> >>>> The zookeeper version is 3.4.6
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <fpj@apache.org <ma...@apache.org>>
> >>>> wrote:
> >>>>
> >>>>> You need to determine why it is not receiving notification messages.
> >>>>> From
> >>>>> the information you've given, it doesn't look like a zookeeper code
> >>>>> issue.
> >>>>>
> >>>>> BTW, which version are you using?
> >>>>>
> >>>>> -Flavio
> >>>>> On 7 Apr 2016 21:20, "s influxdb" <elastic.l.k@gmail.com <ma...@gmail.com>> wrote:
> >>>>>
> >>>>>> nothin on the iptables firewall .
> >>>>>>
> >>>>>> What options do i have to reconnect this node to the cluster ?
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <elastic.l.k@gmail.com <ma...@gmail.com>>
> >>>>> wrote:
> >>>>>>
> >>>>>>> telnet works on 2888 and 3888 to the other nodes. Now i see
> >>>>>>> java.net.SocketTimeoutException: connect timed out messages in the
> >>>>> logs
> >>>>>> for
> >>>>>>> node 2
> >>>>>>>
> >>>>>>> On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fpj@apache.org <ma...@apache.org>>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> I only see notifications from the node to itself. It says that it
> >>>>> is
> >>>>>>>> connected to 1, but it doesn't seem to be receiving the
> >>>>> notification
> >>>>>> from
> >>>>>>>> 1. It also doesn't seem to be receiving the connection request
> >>>>> from 3.
> >>>>>>>>
> >>>>>>>> Last time I've seen something like this was due to iptables rules,
> >>>>> but
> >>>>>> if
> >>>>>>>> it was working before and no configuration has changed, then I
> >>>>> don't
> >>>>>> know
> >>>>>>>> what it could be.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>>> On 07 Apr 2016, at 05:43, s influxdb <elastic.l.k@gmail.com <ma...@gmail.com>>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> this is the pastie
> >>>>>>>>> http://pastie.org/10788301 <http://pastie.org/10788301>
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <
> >>>>> elastic.l.k@gmail.com <ma...@gmail.com>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> We had one of the node giving OOM java.lang.OutOfMemoryError:
> >>>>> unable
> >>>>>> to
> >>>>>>>>>> create new native thread and then being unresponsive.
> >>>>>>>>>>
> >>>>>>>>>> We tried to add the node back to the cluster but with no luck.
> >>>>>>>>>>
> >>>>>>>>>> It doesn't seem to "Receive any notification "  messages from
> >>>>> the
> >>>>>> other
> >>>>>>>>>> nodes.
> >>>>>>>>>> Keeps "Sending notifications " in loop
> >>>>>>>>>>
> >>>>>>>>>> Please see attached the logs of the node that is out of
> >>>>> rotation.
> >>>>>>>>>>
> >>>>>>>>>> Any inputs appreciated.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> 
>

Re: node 2 not rejoining cluster

Posted by s influxdb <el...@gmail.com>.

Thanks Flavio.

Would you know why node2 could not receive ACK from the other 2 nodes .

What is the workaround in scenarios like these where in a 3 node cluster 1
node is not responding
** If we do a rolling restart there is a possiblity of a downtime
** Add 2 more nodes to the configs and do a rolling restart
** Could you think of any way to fix node 2 so that it rejoins the cluster.

Would appreciate your reply.



On Tue, Apr 12, 2016 at 1:33 AM, Flavio Junqueira <fp...@apache.org> wrote:

> Good to hear you've been able to sort it out.
>
> -Flavio
>
> > On 12 Apr 2016, at 03:02, s influxdb <el...@gmail.com> wrote:
> >
> > created a parallel independant zookeeper cluster on the same set of
> > machines with different ports and that worked. This indicates the port
> was
> > the issue.
> >
> > On Mon, Apr 11, 2016 at 1:35 PM, s influxdb <el...@gmail.com>
> wrote:
> >
> >> reboot of the server didn't help
> >>
> >> On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <el...@gmail.com>
> wrote:
> >>
> >>> I ran tcpdump on all the three nodes.
> >>> It looks like that for every  [PSH, ACK] there is a missing [ACK] from
> >>> the other nodes to this 2nd node on port 3888.
> >>>
> >>>
> >>> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <el...@gmail.com>
> wrote:
> >>>
> >>>> Thanks Flavio for your quick replies.
> >>>> The zookeeper version is 3.4.6
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <fp...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> You need to determine why it is not receiving notification messages.
> >>>>> From
> >>>>> the information you've given, it doesn't look like a zookeeper code
> >>>>> issue.
> >>>>>
> >>>>> BTW, which version are you using?
> >>>>>
> >>>>> -Flavio
> >>>>> On 7 Apr 2016 21:20, "s influxdb" <el...@gmail.com> wrote:
> >>>>>
> >>>>>> nothin on the iptables firewall .
> >>>>>>
> >>>>>> What options do i have to reconnect this node to the cluster ?
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <el...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> telnet works on 2888 and 3888 to the other nodes. Now i see
> >>>>>>> java.net.SocketTimeoutException: connect timed out messages in the
> >>>>> logs
> >>>>>> for
> >>>>>>> node 2
> >>>>>>>
> >>>>>>> On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fp...@apache.org>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> I only see notifications from the node to itself. It says that it
> >>>>> is
> >>>>>>>> connected to 1, but it doesn't seem to be receiving the
> >>>>> notification
> >>>>>> from
> >>>>>>>> 1. It also doesn't seem to be receiving the connection request
> >>>>> from 3.
> >>>>>>>>
> >>>>>>>> Last time I've seen something like this was due to iptables rules,
> >>>>> but
> >>>>>> if
> >>>>>>>> it was working before and no configuration has changed, then I
> >>>>> don't
> >>>>>> know
> >>>>>>>> what it could be.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>>> On 07 Apr 2016, at 05:43, s influxdb <el...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> this is the pastie
> >>>>>>>>> http://pastie.org/10788301
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <
> >>>>> elastic.l.k@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> We had one of the node giving OOM java.lang.OutOfMemoryError:
> >>>>> unable
> >>>>>> to
> >>>>>>>>>> create new native thread and then being unresponsive.
> >>>>>>>>>>
> >>>>>>>>>> We tried to add the node back to the cluster but with no luck.
> >>>>>>>>>>
> >>>>>>>>>> It doesn't seem to "Receive any notification "  messages from
> >>>>> the
> >>>>>> other
> >>>>>>>>>> nodes.
> >>>>>>>>>> Keeps "Sending notifications " in loop
> >>>>>>>>>>
> >>>>>>>>>> Please see attached the logs of the node that is out of
> >>>>> rotation.
> >>>>>>>>>>
> >>>>>>>>>> Any inputs appreciated.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: node 2 not rejoining cluster

Posted by Flavio Junqueira <fp...@apache.org>.

Good to hear you've been able to sort it out.

-Flavio

> On 12 Apr 2016, at 03:02, s influxdb <el...@gmail.com> wrote:
> 
> created a parallel independant zookeeper cluster on the same set of
> machines with different ports and that worked. This indicates the port was
> the issue.
> 
> On Mon, Apr 11, 2016 at 1:35 PM, s influxdb <el...@gmail.com> wrote:
> 
>> reboot of the server didn't help
>> 
>> On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <el...@gmail.com> wrote:
>> 
>>> I ran tcpdump on all the three nodes.
>>> It looks like that for every  [PSH, ACK] there is a missing [ACK] from
>>> the other nodes to this 2nd node on port 3888.
>>> 
>>> 
>>> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <el...@gmail.com> wrote:
>>> 
>>>> Thanks Flavio for your quick replies.
>>>> The zookeeper version is 3.4.6
>>>> 
>>>> 
>>>> 
>>>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <fp...@apache.org>
>>>> wrote:
>>>> 
>>>>> You need to determine why it is not receiving notification messages.
>>>>> From
>>>>> the information you've given, it doesn't look like a zookeeper code
>>>>> issue.
>>>>> 
>>>>> BTW, which version are you using?
>>>>> 
>>>>> -Flavio
>>>>> On 7 Apr 2016 21:20, "s influxdb" <el...@gmail.com> wrote:
>>>>> 
>>>>>> nothin on the iptables firewall .
>>>>>> 
>>>>>> What options do i have to reconnect this node to the cluster ?
>>>>>> 
>>>>>> 
>>>>>> On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <el...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> telnet works on 2888 and 3888 to the other nodes. Now i see
>>>>>>> java.net.SocketTimeoutException: connect timed out messages in the
>>>>> logs
>>>>>> for
>>>>>>> node 2
>>>>>>> 
>>>>>>> On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fp...@apache.org>
>>>>> wrote:
>>>>>>> 
>>>>>>>> I only see notifications from the node to itself. It says that it
>>>>> is
>>>>>>>> connected to 1, but it doesn't seem to be receiving the
>>>>> notification
>>>>>> from
>>>>>>>> 1. It also doesn't seem to be receiving the connection request
>>>>> from 3.
>>>>>>>> 
>>>>>>>> Last time I've seen something like this was due to iptables rules,
>>>>> but
>>>>>> if
>>>>>>>> it was working before and no configuration has changed, then I
>>>>> don't
>>>>>> know
>>>>>>>> what it could be.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>>> On 07 Apr 2016, at 05:43, s influxdb <el...@gmail.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> this is the pastie
>>>>>>>>> http://pastie.org/10788301
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <
>>>>> elastic.l.k@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> We had one of the node giving OOM java.lang.OutOfMemoryError:
>>>>> unable
>>>>>> to
>>>>>>>>>> create new native thread and then being unresponsive.
>>>>>>>>>> 
>>>>>>>>>> We tried to add the node back to the cluster but with no luck.
>>>>>>>>>> 
>>>>>>>>>> It doesn't seem to "Receive any notification "  messages from
>>>>> the
>>>>>> other
>>>>>>>>>> nodes.
>>>>>>>>>> Keeps "Sending notifications " in loop
>>>>>>>>>> 
>>>>>>>>>> Please see attached the logs of the node that is out of
>>>>> rotation.
>>>>>>>>>> 
>>>>>>>>>> Any inputs appreciated.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: node 2 not rejoining cluster

Posted by s influxdb <el...@gmail.com>.

created a parallel independant zookeeper cluster on the same set of
machines with different ports and that worked. This indicates the port was
the issue.

On Mon, Apr 11, 2016 at 1:35 PM, s influxdb <el...@gmail.com> wrote:

> reboot of the server didn't help
>
> On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <el...@gmail.com> wrote:
>
>> I ran tcpdump on all the three nodes.
>> It looks like that for every  [PSH, ACK] there is a missing [ACK] from
>> the other nodes to this 2nd node on port 3888.
>>
>>
>> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <el...@gmail.com> wrote:
>>
>>> Thanks Flavio for your quick replies.
>>> The zookeeper version is 3.4.6
>>>
>>>
>>>
>>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <fp...@apache.org>
>>> wrote:
>>>
>>>> You need to determine why it is not receiving notification messages.
>>>> From
>>>> the information you've given, it doesn't look like a zookeeper code
>>>> issue.
>>>>
>>>> BTW, which version are you using?
>>>>
>>>> -Flavio
>>>> On 7 Apr 2016 21:20, "s influxdb" <el...@gmail.com> wrote:
>>>>
>>>> > nothin on the iptables firewall .
>>>> >
>>>> > What options do i have to reconnect this node to the cluster ?
>>>> >
>>>> >
>>>> > On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <el...@gmail.com>
>>>> wrote:
>>>> >
>>>> > > telnet works on 2888 and 3888 to the other nodes. Now i see
>>>> > > java.net.SocketTimeoutException: connect timed out messages in the
>>>> logs
>>>> > for
>>>> > > node 2
>>>> > >
>>>> > > On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fp...@apache.org>
>>>> wrote:
>>>> > >
>>>> > >> I only see notifications from the node to itself. It says that it
>>>> is
>>>> > >> connected to 1, but it doesn't seem to be receiving the
>>>> notification
>>>> > from
>>>> > >> 1. It also doesn't seem to be receiving the connection request
>>>> from 3.
>>>> > >>
>>>> > >> Last time I've seen something like this was due to iptables rules,
>>>> but
>>>> > if
>>>> > >> it was working before and no configuration has changed, then I
>>>> don't
>>>> > know
>>>> > >> what it could be.
>>>> > >>
>>>> > >> -Flavio
>>>> > >>
>>>> > >> > On 07 Apr 2016, at 05:43, s influxdb <el...@gmail.com>
>>>> wrote:
>>>> > >> >
>>>> > >> > this is the pastie
>>>> > >> > http://pastie.org/10788301
>>>> > >> >
>>>> > >> > On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <
>>>> elastic.l.k@gmail.com>
>>>> > >> wrote:
>>>> > >> >
>>>> > >> >> We had one of the node giving OOM java.lang.OutOfMemoryError:
>>>> unable
>>>> > to
>>>> > >> >> create new native thread and then being unresponsive.
>>>> > >> >>
>>>> > >> >> We tried to add the node back to the cluster but with no luck.
>>>> > >> >>
>>>> > >> >> It doesn't seem to "Receive any notification "  messages from
>>>> the
>>>> > other
>>>> > >> >> nodes.
>>>> > >> >> Keeps "Sending notifications " in loop
>>>> > >> >>
>>>> > >> >> Please see attached the logs of the node that is out of
>>>> rotation.
>>>> > >> >>
>>>> > >> >> Any inputs appreciated.
>>>> > >> >>
>>>> > >> >> Thanks
>>>> > >> >>
>>>> > >>
>>>> > >>
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: node 2 not rejoining cluster

Posted by s influxdb <el...@gmail.com>.

reboot of the server didn't help

On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <el...@gmail.com> wrote:

> I ran tcpdump on all the three nodes.
> It looks like that for every  [PSH, ACK] there is a missing [ACK] from the
> other nodes to this 2nd node on port 3888.
>
>
> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <el...@gmail.com> wrote:
>
>> Thanks Flavio for your quick replies.
>> The zookeeper version is 3.4.6
>>
>>
>>
>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <fp...@apache.org>
>> wrote:
>>
>>> You need to determine why it is not receiving notification messages. From
>>> the information you've given, it doesn't look like a zookeeper code
>>> issue.
>>>
>>> BTW, which version are you using?
>>>
>>> -Flavio
>>> On 7 Apr 2016 21:20, "s influxdb" <el...@gmail.com> wrote:
>>>
>>> > nothin on the iptables firewall .
>>> >
>>> > What options do i have to reconnect this node to the cluster ?
>>> >
>>> >
>>> > On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <el...@gmail.com>
>>> wrote:
>>> >
>>> > > telnet works on 2888 and 3888 to the other nodes. Now i see
>>> > > java.net.SocketTimeoutException: connect timed out messages in the
>>> logs
>>> > for
>>> > > node 2
>>> > >
>>> > > On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fp...@apache.org>
>>> wrote:
>>> > >
>>> > >> I only see notifications from the node to itself. It says that it is
>>> > >> connected to 1, but it doesn't seem to be receiving the notification
>>> > from
>>> > >> 1. It also doesn't seem to be receiving the connection request from
>>> 3.
>>> > >>
>>> > >> Last time I've seen something like this was due to iptables rules,
>>> but
>>> > if
>>> > >> it was working before and no configuration has changed, then I don't
>>> > know
>>> > >> what it could be.
>>> > >>
>>> > >> -Flavio
>>> > >>
>>> > >> > On 07 Apr 2016, at 05:43, s influxdb <el...@gmail.com>
>>> wrote:
>>> > >> >
>>> > >> > this is the pastie
>>> > >> > http://pastie.org/10788301
>>> > >> >
>>> > >> > On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <elastic.l.k@gmail.com
>>> >
>>> > >> wrote:
>>> > >> >
>>> > >> >> We had one of the node giving OOM java.lang.OutOfMemoryError:
>>> unable
>>> > to
>>> > >> >> create new native thread and then being unresponsive.
>>> > >> >>
>>> > >> >> We tried to add the node back to the cluster but with no luck.
>>> > >> >>
>>> > >> >> It doesn't seem to "Receive any notification "  messages from the
>>> > other
>>> > >> >> nodes.
>>> > >> >> Keeps "Sending notifications " in loop
>>> > >> >>
>>> > >> >> Please see attached the logs of the node that is out of rotation.
>>> > >> >>
>>> > >> >> Any inputs appreciated.
>>> > >> >>
>>> > >> >> Thanks
>>> > >> >>
>>> > >>
>>> > >>
>>> > >
>>> >
>>>
>>
>>
>

Re: node 2 not rejoining cluster

Posted by s influxdb <el...@gmail.com>.

I ran tcpdump on all the three nodes.
It looks like that for every  [PSH, ACK] there is a missing [ACK] from the
other nodes to this 2nd node on port 3888.


On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <el...@gmail.com> wrote:

> Thanks Flavio for your quick replies.
> The zookeeper version is 3.4.6
>
>
>
> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <fp...@apache.org> wrote:
>
>> You need to determine why it is not receiving notification messages. From
>> the information you've given, it doesn't look like a zookeeper code issue.
>>
>> BTW, which version are you using?
>>
>> -Flavio
>> On 7 Apr 2016 21:20, "s influxdb" <el...@gmail.com> wrote:
>>
>> > nothin on the iptables firewall .
>> >
>> > What options do i have to reconnect this node to the cluster ?
>> >
>> >
>> > On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <el...@gmail.com>
>> wrote:
>> >
>> > > telnet works on 2888 and 3888 to the other nodes. Now i see
>> > > java.net.SocketTimeoutException: connect timed out messages in the
>> logs
>> > for
>> > > node 2
>> > >
>> > > On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fp...@apache.org>
>> wrote:
>> > >
>> > >> I only see notifications from the node to itself. It says that it is
>> > >> connected to 1, but it doesn't seem to be receiving the notification
>> > from
>> > >> 1. It also doesn't seem to be receiving the connection request from
>> 3.
>> > >>
>> > >> Last time I've seen something like this was due to iptables rules,
>> but
>> > if
>> > >> it was working before and no configuration has changed, then I don't
>> > know
>> > >> what it could be.
>> > >>
>> > >> -Flavio
>> > >>
>> > >> > On 07 Apr 2016, at 05:43, s influxdb <el...@gmail.com>
>> wrote:
>> > >> >
>> > >> > this is the pastie
>> > >> > http://pastie.org/10788301
>> > >> >
>> > >> > On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <el...@gmail.com>
>> > >> wrote:
>> > >> >
>> > >> >> We had one of the node giving OOM java.lang.OutOfMemoryError:
>> unable
>> > to
>> > >> >> create new native thread and then being unresponsive.
>> > >> >>
>> > >> >> We tried to add the node back to the cluster but with no luck.
>> > >> >>
>> > >> >> It doesn't seem to "Receive any notification "  messages from the
>> > other
>> > >> >> nodes.
>> > >> >> Keeps "Sending notifications " in loop
>> > >> >>
>> > >> >> Please see attached the logs of the node that is out of rotation.
>> > >> >>
>> > >> >> Any inputs appreciated.
>> > >> >>
>> > >> >> Thanks
>> > >> >>
>> > >>
>> > >>
>> > >
>> >
>>
>
>

Re: node 2 not rejoining cluster

Posted by s influxdb <el...@gmail.com>.

Thanks Flavio for your quick replies.
The zookeeper version is 3.4.6



On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <fp...@apache.org> wrote:

> You need to determine why it is not receiving notification messages. From
> the information you've given, it doesn't look like a zookeeper code issue.
>
> BTW, which version are you using?
>
> -Flavio
> On 7 Apr 2016 21:20, "s influxdb" <el...@gmail.com> wrote:
>
> > nothin on the iptables firewall .
> >
> > What options do i have to reconnect this node to the cluster ?
> >
> >
> > On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <el...@gmail.com>
> wrote:
> >
> > > telnet works on 2888 and 3888 to the other nodes. Now i see
> > > java.net.SocketTimeoutException: connect timed out messages in the logs
> > for
> > > node 2
> > >
> > > On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fp...@apache.org>
> wrote:
> > >
> > >> I only see notifications from the node to itself. It says that it is
> > >> connected to 1, but it doesn't seem to be receiving the notification
> > from
> > >> 1. It also doesn't seem to be receiving the connection request from 3.
> > >>
> > >> Last time I've seen something like this was due to iptables rules, but
> > if
> > >> it was working before and no configuration has changed, then I don't
> > know
> > >> what it could be.
> > >>
> > >> -Flavio
> > >>
> > >> > On 07 Apr 2016, at 05:43, s influxdb <el...@gmail.com> wrote:
> > >> >
> > >> > this is the pastie
> > >> > http://pastie.org/10788301
> > >> >
> > >> > On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <el...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> We had one of the node giving OOM java.lang.OutOfMemoryError:
> unable
> > to
> > >> >> create new native thread and then being unresponsive.
> > >> >>
> > >> >> We tried to add the node back to the cluster but with no luck.
> > >> >>
> > >> >> It doesn't seem to "Receive any notification "  messages from the
> > other
> > >> >> nodes.
> > >> >> Keeps "Sending notifications " in loop
> > >> >>
> > >> >> Please see attached the logs of the node that is out of rotation.
> > >> >>
> > >> >> Any inputs appreciated.
> > >> >>
> > >> >> Thanks
> > >> >>
> > >>
> > >>
> > >
> >
>

Re: node 2 not rejoining cluster

Posted by Flavio P JUNQUEIRA <fp...@apache.org>.

You need to determine why it is not receiving notification messages. From
the information you've given, it doesn't look like a zookeeper code issue.

BTW, which version are you using?

-Flavio
On 7 Apr 2016 21:20, "s influxdb" <el...@gmail.com> wrote:

> nothin on the iptables firewall .
>
> What options do i have to reconnect this node to the cluster ?
>
>
> On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <el...@gmail.com> wrote:
>
> > telnet works on 2888 and 3888 to the other nodes. Now i see
> > java.net.SocketTimeoutException: connect timed out messages in the logs
> for
> > node 2
> >
> > On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fp...@apache.org> wrote:
> >
> >> I only see notifications from the node to itself. It says that it is
> >> connected to 1, but it doesn't seem to be receiving the notification
> from
> >> 1. It also doesn't seem to be receiving the connection request from 3.
> >>
> >> Last time I've seen something like this was due to iptables rules, but
> if
> >> it was working before and no configuration has changed, then I don't
> know
> >> what it could be.
> >>
> >> -Flavio
> >>
> >> > On 07 Apr 2016, at 05:43, s influxdb <el...@gmail.com> wrote:
> >> >
> >> > this is the pastie
> >> > http://pastie.org/10788301
> >> >
> >> > On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <el...@gmail.com>
> >> wrote:
> >> >
> >> >> We had one of the node giving OOM java.lang.OutOfMemoryError: unable
> to
> >> >> create new native thread and then being unresponsive.
> >> >>
> >> >> We tried to add the node back to the cluster but with no luck.
> >> >>
> >> >> It doesn't seem to "Receive any notification "  messages from the
> other
> >> >> nodes.
> >> >> Keeps "Sending notifications " in loop
> >> >>
> >> >> Please see attached the logs of the node that is out of rotation.
> >> >>
> >> >> Any inputs appreciated.
> >> >>
> >> >> Thanks
> >> >>
> >>
> >>
> >
>

Re: node 2 not rejoining cluster

Posted by s influxdb <el...@gmail.com>.

nothin on the iptables firewall .

What options do i have to reconnect this node to the cluster ?


On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <el...@gmail.com> wrote:

> telnet works on 2888 and 3888 to the other nodes. Now i see
> java.net.SocketTimeoutException: connect timed out messages in the logs for
> node 2
>
> On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fp...@apache.org> wrote:
>
>> I only see notifications from the node to itself. It says that it is
>> connected to 1, but it doesn't seem to be receiving the notification from
>> 1. It also doesn't seem to be receiving the connection request from 3.
>>
>> Last time I've seen something like this was due to iptables rules, but if
>> it was working before and no configuration has changed, then I don't know
>> what it could be.
>>
>> -Flavio
>>
>> > On 07 Apr 2016, at 05:43, s influxdb <el...@gmail.com> wrote:
>> >
>> > this is the pastie
>> > http://pastie.org/10788301
>> >
>> > On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <el...@gmail.com>
>> wrote:
>> >
>> >> We had one of the node giving OOM java.lang.OutOfMemoryError: unable to
>> >> create new native thread and then being unresponsive.
>> >>
>> >> We tried to add the node back to the cluster but with no luck.
>> >>
>> >> It doesn't seem to "Receive any notification "  messages from the other
>> >> nodes.
>> >> Keeps "Sending notifications " in loop
>> >>
>> >> Please see attached the logs of the node that is out of rotation.
>> >>
>> >> Any inputs appreciated.
>> >>
>> >> Thanks
>> >>
>>
>>
>

Re: node 2 not rejoining cluster

Posted by s influxdb <el...@gmail.com>.

telnet works on 2888 and 3888 to the other nodes. Now i see
java.net.SocketTimeoutException: connect timed out messages in the logs for
node 2

On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fp...@apache.org> wrote:

> I only see notifications from the node to itself. It says that it is
> connected to 1, but it doesn't seem to be receiving the notification from
> 1. It also doesn't seem to be receiving the connection request from 3.
>
> Last time I've seen something like this was due to iptables rules, but if
> it was working before and no configuration has changed, then I don't know
> what it could be.
>
> -Flavio
>
> > On 07 Apr 2016, at 05:43, s influxdb <el...@gmail.com> wrote:
> >
> > this is the pastie
> > http://pastie.org/10788301
> >
> > On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <el...@gmail.com>
> wrote:
> >
> >> We had one of the node giving OOM java.lang.OutOfMemoryError: unable to
> >> create new native thread and then being unresponsive.
> >>
> >> We tried to add the node back to the cluster but with no luck.
> >>
> >> It doesn't seem to "Receive any notification "  messages from the other
> >> nodes.
> >> Keeps "Sending notifications " in loop
> >>
> >> Please see attached the logs of the node that is out of rotation.
> >>
> >> Any inputs appreciated.
> >>
> >> Thanks
> >>
>
>

Re: node 2 not rejoining cluster

Posted by Flavio Junqueira <fp...@apache.org>.

I only see notifications from the node to itself. It says that it is connected to 1, but it doesn't seem to be receiving the notification from 1. It also doesn't seem to be receiving the connection request from 3.

Last time I've seen something like this was due to iptables rules, but if it was working before and no configuration has changed, then I don't know what it could be.

-Flavio

> On 07 Apr 2016, at 05:43, s influxdb <el...@gmail.com> wrote:
> 
> this is the pastie
> http://pastie.org/10788301
> 
> On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <el...@gmail.com> wrote:
> 
>> We had one of the node giving OOM java.lang.OutOfMemoryError: unable to
>> create new native thread and then being unresponsive.
>> 
>> We tried to add the node back to the cluster but with no luck.
>> 
>> It doesn't seem to "Receive any notification "  messages from the other
>> nodes.
>> Keeps "Sending notifications " in loop
>> 
>> Please see attached the logs of the node that is out of rotation.
>> 
>> Any inputs appreciated.
>> 
>> Thanks
>>

Re: node 2 not rejoining cluster

Posted by s influxdb <el...@gmail.com>.

this is the pastie
http://pastie.org/10788301

On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <el...@gmail.com> wrote:

> We had one of the node giving OOM java.lang.OutOfMemoryError: unable to
> create new native thread and then being unresponsive.
>
> We tried to add the node back to the cluster but with no luck.
>
> It doesn't seem to "Receive any notification "  messages from the other
> nodes.
> Keeps "Sending notifications " in loop
>
> Please see attached the logs of the node that is out of rotation.
>
> Any inputs appreciated.
>
> Thanks
>