You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Vishal K (JIRA)" <ji...@apache.org> on 2010/11/02 19:10:27 UTC

[jira] Commented: (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927518#action_12927518 ] 

Vishal K commented on ZOOKEEPER-900:
------------------------------------

Hi Flavio,

I have a suggestion for changing the blocking IO code in QuorumCnxManager. It keeps the current code structure and requires a small amount of changes. I am not sure if these comments should go in ZOOKEEPER-901. ZOOKEEPER-901 is probably addressing netty as well. Please feel free to close this JIRA if you intend to make all the changes as a part of ZOOKEEPER-901.

Basically we jusy need to move parts of initiateConnection and receiveConnection to SenderWorker and ReceiveWorker.

A. Current flow for receiving connection:
1. accept connection in Listener.run()
2. receiveConnection()
	- Read remote server's ID
	- Take action based on my ID and remote server's ID (disconnect and reconnect if my ID is > remote server's ID).
	- kill current set of SenderWorker and ReciveWorker threads
	- Start a new pair

B Current flow for initiating connection:
1. In connectOne(), connect if not already connected. else return.
2. send my ID to the remote server
3. if my ID < remote server disconnect and return
4. if my ID > remote server
	- kill current set of SenderWorker and ReceiveWorkter threads for the remote server
	- Start a new pair


Proposed changes:
Move the code that performs any blocking IO in SenderWorker and ReceiveWorker.

A. Proposed flow for receiving connection:
1. accept connection in Listener.run()
2. receiveConnection()
	- kill current set of SenderWorker and ReciveWorker threads
	- Start a new pair

Proposed changed to SenderWorker:
	- Read remote server's ID
	- Take action based on my ID and remote server's ID (disconnect and reconnect if my ID is > remote server's ID).
	- Proceed to normal operation


B Proposed flow for initiating connection:
1. in connectOne(), return if already connected
2. Start a new SenderWorker and ReceiveWorker pair
2. In SenderWorker
	- connect to remote server
	- write my ID
	- if my ID < remote server disconnect and return (shutdown the pair).
	- Proceed to normal operation


Questions:
- In QuorumCnxManager, is it necessary to kill the current pair and restart a new one every time we receive a connect request?
- In receiveConnection we may choose to reject an accepted connection if a thread in
SenderWorker is in the process of connecting. Otherwise a server with ID <
remote server may keep sending frequent connect request that will result in the
remote server closing connections for this peer. But I think we add a delay
before sending notifications, which might be good enough to prevent this
problem.

Let me know what you think about this. I can also help with the implementation.

-Vishal


> FLE implementation should be improved to use non-blocking sockets
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-900
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900
>             Project: Zookeeper
>          Issue Type: Bug
>            Reporter: Vishal K
>            Assignee: Flavio Junqueira
>            Priority: Critical
>
> From earlier email exchanges:
> 1. Blocking connects and accepts:
> a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective !
 peer.
> b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify.
> Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.