You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Powell Molleti <pm...@vmware.com> on 2015/08/31 23:34:38 UTC

Re: quorum connection manager shutdown takes long time

In reference to:
https://issues.apache.org/jira/browse/ZOOKEEPER-2246

Plainly removing  sock.setSoTimeout(0) from http://s.apache.org/TfI has the unintended consequence of shutting down both the RecvWorker and SendWorker threads for all cases. Seems like current code is designed to  keep the socket alive (and threads to keep running) so as to reuse this channel to communicate again with the the peer node which still alive but needs to redo leader election.

I could not reproduce any issue if threads shutdown after the timeout since new threads are created for next iteration of leader election. I rather would like to reuse the threads and the channel hence I propose the following approach.

The alternative I suggest is to still remove setSoTimeout(0) from here: http://s.apache.org/TfI  , also enable SO_KEEPALIVE via setKeepAlive() on this socket and do not consider it an error when timeout occurs here: http://bit.ly/1JHIdVY but consider it an error when it happens here: http://bit.ly/1NTjQ9R

This means that users can play with keep alive timeouts for TCP sockets to quicken TCP socket failures propagating to user-space and zookeeper also resets the socket if it detects other side is not responding when it knows it needs a response within some bounded time.

Ideally I wish there is some userspace pings of every socket channel between zookeeper nodes to detect dead channels quickly. Seems like one exists for sockets that do Follow/Lead after leader election is done but not for this?. Such a feature could be added with care towards making it backward compatible.

I posted the above text to Jira. Also please point out any wrong assumptions I have made and provide comments and suggestions.

Thanks
Powell.

> From Raúl Gutiérrez Segalés <.....@itevenworks.net>
> Subject Re: quorum connection manager shutdown takes long time
> Date Thu, 10 Jul 2014 18:02:37 GMT
> On 9 July 2014 08:28, Michi Mutsuzaki <mi...@cs.stanford.edu> wrote:

>> I don't know how I missed that :) QA said this is reproducible, so
>> I'll try commenting this line out. Thanks Flavio!
>>

> I am curious, was it that?
> -rgs

Re: quorum connection manager shutdown takes long time

Posted by Powell Molleti <pm...@vmware.com>.

Apologies for not posting the link to the old thread, here it is:
http://bit.ly/1JAaJaJ

Thanks
Powell.

On 8/31/15, 2:34 PM, "Powell Molleti" <pm...@vmware.com> wrote:

>In reference to:
>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jir
>a_browse_ZOOKEEPER-2D2246&d=BQIFAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNt
>Xt-uEs&r=yJGBUr8YNYcKMSgrAENRm8UHFXYvY5J31UIvOjn58UU&m=7rVn1QkiMOK6B21p_op
>YW1s-OXb2MVJaveBSbPqIFQw&s=UVM1pPxP0lnSUZGXwuC4jgmqh82pMqRdHJTXWKjy7pQ&e=
>
>Plainly removing  sock.setSoTimeout(0) from
>https://urldefense.proofpoint.com/v2/url?u=http-3A__s.apache.org_TfI&d=BQI
>FAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=yJGBUr8YNYcKMSgrAENRm8
>UHFXYvY5J31UIvOjn58UU&m=7rVn1QkiMOK6B21p_opYW1s-OXb2MVJaveBSbPqIFQw&s=Sddv
>lzYICW65qMs-kxwcASfZGRMQKh_67Ot4EpzPW4k&e=  has the unintended
>consequence of shutting down both the RecvWorker and SendWorker threads
>for all cases. Seems like current code is designed to  keep the socket
>alive (and threads to keep running) so as to reuse this channel to
>communicate again with the the peer node which still alive but needs to
>redo leader election.
>
>I could not reproduce any issue if threads shutdown after the timeout
>since new threads are created for next iteration of leader election. I
>rather would like to reuse the threads and the channel hence I propose
>the following approach.
>
>The alternative I suggest is to still remove setSoTimeout(0) from here:
>https://urldefense.proofpoint.com/v2/url?u=http-3A__s.apache.org_TfI&d=BQI
>FAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=yJGBUr8YNYcKMSgrAENRm8
>UHFXYvY5J31UIvOjn58UU&m=7rVn1QkiMOK6B21p_opYW1s-OXb2MVJaveBSbPqIFQw&s=Sddv
>lzYICW65qMs-kxwcASfZGRMQKh_67Ot4EpzPW4k&e=   , also enable SO_KEEPALIVE
>via setKeepAlive() on this socket and do not consider it an error when
>timeout occurs here:
>https://urldefense.proofpoint.com/v2/url?u=http-3A__bit.ly_1JHIdVY&d=BQIFA
>w&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=yJGBUr8YNYcKMSgrAENRm8UH
>FXYvY5J31UIvOjn58UU&m=7rVn1QkiMOK6B21p_opYW1s-OXb2MVJaveBSbPqIFQw&s=ktRCMe
>jYwu8LPG_s1B6_rlPeoZFTNj8PrRET3yEAg6A&e=  but consider it an error when
>it happens here: 
>https://urldefense.proofpoint.com/v2/url?u=http-3A__bit.ly_1NTjQ9R&d=BQIFA
>w&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=yJGBUr8YNYcKMSgrAENRm8UH
>FXYvY5J31UIvOjn58UU&m=7rVn1QkiMOK6B21p_opYW1s-OXb2MVJaveBSbPqIFQw&s=jUAFeY
>zMBnBkanBaYzZ8blViliOscQ4eSd0xm7FYb9g&e=
>
>This means that users can play with keep alive timeouts for TCP sockets
>to quicken TCP socket failures propagating to user-space and zookeeper
>also resets the socket if it detects other side is not responding when it
>knows it needs a response within some bounded time.
>
>Ideally I wish there is some userspace pings of every socket channel
>between zookeeper nodes to detect dead channels quickly. Seems like one
>exists for sockets that do Follow/Lead after leader election is done but
>not for this?. Such a feature could be added with care towards making it
>backward compatible.
>
>I posted the above text to Jira. Also please point out any wrong
>assumptions I have made and provide comments and suggestions.
>
>Thanks
>Powell.
>
>
>> From Raúl Gutiérrez Segalés <.....@itevenworks.net>
>> Subject Re: quorum connection manager shutdown takes long time
>> Date Thu, 10 Jul 2014 18:02:37 GMT
>> On 9 July 2014 08:28, Michi Mutsuzaki <mi...@cs.stanford.edu> wrote:
>
>>> I don't know how I missed that :) QA said this is reproducible, so
>>> I'll try commenting this line out. Thanks Flavio!
>>>
>
>> I am curious, was it that?
>> -rgs
>