You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Vishal K (JIRA)" <ji...@apache.org> on 2010/11/10 21:08:13 UTC

[jira] Created: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Follower should stop following and start FLE if it does not receive pings from the leader
-----------------------------------------------------------------------------------------

                 Key: ZOOKEEPER-928
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
             Project: Zookeeper
          Issue Type: Bug
    Affects Versions: 3.3.2
            Reporter: Vishal K
             Fix For: 3.4.0


In Follower.followLeader() after syncing with the leader, the follower does:
                while (self.isRunning()) {
                    readPacket(qp);
                    processPacket(qp);
                }

It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.

We should keep track of pings received from the leader and see if we havent seen
a ping packet from the leader for (syncLimit * tickTime) time and give up following the
leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar resolved ZOOKEEPER-928.
-------------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: 3.3.3)
                       (was: 3.4.0)

No worries Vishal. Resolving the issue as wont fix. 

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930800#action_12930800 ] 

Flavio Junqueira commented on ZOOKEEPER-928:
--------------------------------------------

My understanding is that SO_TIMEOUT also affects SocketChannel, since it builds on top of a Socket object.

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930905#action_12930905 ] 

Patrick Hunt commented on ZOOKEEPER-928:
----------------------------------------

according to this it's not a bug:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4614802

specifically:

The read methods in SocketChannel (and DatagramChannel) do not
support timeouts.  If you need the timeout functionality then use the read
methods of the associated Socket (or DatagramSocket) object.

notice this was asked/answered a while ago though, however I suspect it's still true.

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931085#action_12931085 ] 

Vishal K commented on ZOOKEEPER-928:
------------------------------------

Hi Flavio,

Thats correct. I was planning to do this change (in addition to other changes) as a part of ZOOKEEPER-900.
But now I think it is better if we make this change first and not wait for other changes. So that we don't have to wait till 3.4.0 for this fix.
At least, that will get us around the block forever problem.

-Vishal




> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930780#action_12930780 ] 

Vishal K commented on ZOOKEEPER-928:
------------------------------------

Hi Flavio,

I was aware of that. However, this is not the case of idefinite TCP IO hang. If the leader hangs (e.g., software deadlock in ZooKeeper) its TCP connection will remain active. The follower will not see a socket timeout. Now, how can the follower determine if the leader is down?

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930789#action_12930789 ] 

Vishal K commented on ZOOKEEPER-928:
------------------------------------

sorry for the false alarm. I got confused since SocketChannel is used in quorumCnxManager but this part of the code uses Socket and InputArchive.

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930851#action_12930851 ] 

Flavio Junqueira commented on ZOOKEEPER-928:
--------------------------------------------

The documentation refers to SocketInputStream.read(), but it doesn't mention SocketChannel.read(). I ran a quick test with QuorumCnxManager and it doesn't seem to work. So maybe it is true that setting SO_TIMEOUT has no effect on SocketChannel.read(), which is kind of surprising to me. 

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930806#action_12930806 ] 

Vishal K commented on ZOOKEEPER-928:
------------------------------------

Hi Flavio,

Can you please try it with SocketChannel and confirm?

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt updated ZOOKEEPER-928:
-----------------------------------

      Component/s: server
                   quorum
         Priority: Critical  (was: Major)
    Fix Version/s: 3.3.3

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930788#action_12930788 ] 

Flavio Junqueira commented on ZOOKEEPER-928:
--------------------------------------------

Hi Vishal, My understanding is that the readRecord call in readPacket will timeout, even if the TCP connection is still up. The documentation in: http://download.oracle.com/javase/6/docs/api/java/net/SocketOptions.html

says that:
{noformat}
static int 	SO_TIMEOUT
          Set a timeout on blocking Socket operations:
{noformat}

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930774#action_12930774 ] 

Flavio Junqueira commented on ZOOKEEPER-928:
--------------------------------------------

I've just seen the messages on zookeeper-dev, and I'm not sure this is right:

# readPacket is implemented in Learner.java, and the socket read is performed in this line: leaderIs.readRecord(pp, "packet");
# leaderIs is an InputArchive instance instantiated in Learner:connectToLeader;
# The socket used to instantiate leaderIs has its SO_TIMEOUT value set right before in connectToLeader: sock.setSoTimeout(self.tickTime * self.initLimit).

Consequently, the operation should not be delayed indefinitely and should return after self.tickTime * self.initLimit. This discussion on SO_TIMEOUT sounds familiar, huh? ;-)

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930779#action_12930779 ] 

Mahadev konar commented on ZOOKEEPER-928:
-----------------------------------------

good point Flavio! I totally forgot about that. That should prevent this failure case. Vishal your thoughts?


> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930953#action_12930953 ] 

Flavio Junqueira commented on ZOOKEEPER-928:
--------------------------------------------

Good point, Pat. I should have remembered this, since our hack to introduce the 
connection timeout in QCM previously was through the socket directly, so it makes
sense that we would have to do the same for other blocking operations. In fact, I 
have quickly tried replacing the read call in receiveConnection with the following:

{noformat}
s.socket().getInputStream().read(msgBytes);
{noformat}

and I get a SocketTimeoutException after the especified timeout. 

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Vishal K (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930786#action_12930786 ] 

Vishal K commented on ZOOKEEPER-928:
------------------------------------

ok, I see your point. I mis-analyzed this part of the code. I will wait for Flavio to comment and then close the jira.

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930785#action_12930785 ] 

Mahadev konar commented on ZOOKEEPER-928:
-----------------------------------------

vishal,
 Here is the definition of setSoTimeout -

{code}
public void setSoTimeout(int timeout)
                  throws SocketException
Enable/disable SO_TIMEOUT with the specified timeout, in milliseconds. With this option set to a non-zero timeout, a read() call on the InputStream associated with this Socket will block for only this amount of time. If the timeout expires, a java.net.SocketTimeoutException is raised, though the Socket is still valid. The option must be enabled prior to entering the blocking operation to have effect. The timeout must be > 0. A timeout of zero is interpreted as an infinite timeout.
{code}

This means is that the read would block till timeout and throw an exception if it doesnt hear from the leader during that time. Wouldnt this suffice?

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.