You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by 东方甲乙 <25...@qq.com> on 2017/05/21 13:51:17 UTC

RE： [DISCUSS] KIP-148: Add a connect timeout for client

Hi Guozhang,

Thanks for the clarify. For the clarify 2, I think the key thing is not users control how much time in maximum to wait for inside code, but is the network client can be aware of the connecting can't be finished and try a good node. In the producer.sender even the selector.poll can timeout, but the next time is also not close the previous connecting and try another good node.

In out test env, QA shutdown one of the leader node, the producer send the request will timeout and close the node's connection then request the metadata.  But sometimes the request node is also the shutdown node.  When connecting the shutting down node to get the metadata, it is in the connecting phase, network client mark the connecting node's state to CONNECTING, but if the node is shutdown,  the socket can't be aware of the connecting is broken. Though the selector.poll has timeout parameter, but it will not close the connection, so the next 
time in the "networkclient.maybeUpdate" it will check if isAnyNodeConnecting, then will not connect to any good node the get the metadata.  It need about several minutes to 
aware the connecting is timeout and try other node.  

So I want to add a connect.timeout parameter,  the selector can find the connecting is timeout and close the connection.  It seems the currently the timeout value passed in `selector.poll()`
seems can not do this.

Thanks,
David

------------------ 原始邮件 ------------------
发件人: "Guozhang Wang";<wa...@gmail.com>;
发送时间: 2017年5月16日(星期二) 凌晨1:51
收件人: "dev@kafka.apache.org"<de...@kafka.apache.org>; 

主题: Re: [DISCUSS] KIP-148: Add a connect timeout for client

Hi David,

I may be a bit confused before, just clarifying a few things:

1. As you mentioned, a client will always try to first establish the
connection with a broker node before it tries to send any request to it.
And after connection is established, it will either continuously send many
requests (e.g. produce) for just a single request (e.g. metadata) to the
broker, so these two phases are indeed different.

2. In the connected phase, connections.max.idle.ms is used to
auto-disconnect the socket if no requests has been sent / received during
that period of time; in the connecting phase, we always try to create the
socket via "socketChannel.connect" in a non-blocking call, and then checks
if the connection has been established, but all the callers of this
function (in either producer or consumer) has a timeout parameter as in
`selector.poll()`, and the timeout parameter is set either by calculations
based on metadata.expiration.time and backoff for producer#sender, or by
directly passed values from consumer#poll(timeout), so although there is no
directly config controlling that, users can still control how much time in
maximum to wait for inside code.

I originally thought your scenarios is more on the connected phase, but now
I feel you are talking about the connecting phase. For that case, I still
feel currently the timeout value passed in `selector.poll()` which is
controllable from user code should be sufficient?

Guozhang

On Sun, May 14, 2017 at 2:37 AM, 东方甲乙 <25...@qq.com> wrote:

> Hi Guozhang,
>
>
> Sorry for the delay, thanks for the question.  It seems two different
> parameters to me:
> connect.timeout.ms: only work for the connecting phrase, after connected
> phrase this parameter is not used.
> connections.max.idle.ms: currently not work in the connecting phrase
> (only select return readyKeys >0) will add to the expired manager, after
> connected will check if the connection is still alive in some time.
>
>
> Even if we change the connections.max.idle.ms to work including the
> connecting phrase, we can not set this parameter to a small value, such as
> 5 seconds. Because the client is maybe busy sending message to other node,
> it will be disconnected in 5 seconds, so the default value of
> connections.max.idle.ms is setting to a larger time. We should have two
> parameters to control the connecting phrase behavior and the connected
> phrase behavior, do you think so?
>
>
> Thanks,
>
>
> David
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Guozhang Wang";<wa...@gmail.com>;
> 发送时间: 2017年5月6日(星期六) 上午7:52
> 收件人: "dev@kafka.apache.org"<de...@kafka.apache.org>;
>
> 主题: Re: [DISCUSS] KIP-148: Add a connect timeout for client
>
>
>
> Hello David,
>
> Thanks for the KIP. For the described issue, I'm wondering if it can be
> resolved by tuning the CONNECTIONS_MAX_IDLE_MS_CONFIG (
> connections.max.idle.ms) on the client side? Default is 9 minutes.
>
>
> Guozhang
>
> On Tue, May 2, 2017 at 8:22 AM, 东方甲乙 <25...@qq.com> wrote:
>
> > Hi all,
> >
> > Currently in our test environment, we found that after one of the broker
> > node crash (reboot or os crash), the client may still be connecting to
> the
> > crash node to send metadata request or other request, and it needs
> several
> > minutes to be aware that the connection is timeout then try another node
> to
> > connect to send the request. Then the client may still not be aware of
> the
> > metadata change after several minutes.
> >
> >
> > So I want to add a connect timeout on the  client,  please take a look
> at：
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 148%3A+Add+a+connect+timeout+for+client
> >
> > Regards,
> >
> > David
>
>
>
>
> --
> -- Guozhang
>

-- 
-- Guozhang

Re: [DISCUSS] KIP-148: Add a connect timeout for client

Posted by Rajini Sivaram <ra...@gmail.com>.

Hi David,

Is there a reason why you wouldn't want to use request.timeout.ms as the
timeout parameter for connections? Then you would use the same timeout for
connected and connecting phases when shutdown is unclean. You could still
use the timeout to ensure that next metadata request is sent to another
node.

Regards,

Rajini

On Sun, May 21, 2017 at 9:51 AM, 东方甲乙 <25...@qq.com> wrote:

> Hi Guozhang,
>
>
> Thanks for the clarify. For the clarify 2, I think the key thing is not
> users control how much time in maximum to wait for inside code, but is the
> network client can be aware of the connecting can't be finished and try a
> good node. In the producer.sender even the selector.poll can timeout, but
> the next time is also not close the previous connecting and try another
> good node.
>
>
> In out test env, QA shutdown one of the leader node, the producer send the
> request will timeout and close the node's connection then request the
> metadata.  But sometimes the request node is also the shutdown node.  When
> connecting the shutting down node to get the metadata, it is in the
> connecting phase, network client mark the connecting node's state to
> CONNECTING, but if the node is shutdown,  the socket can't be aware of the
> connecting is broken. Though the selector.poll has timeout parameter, but
> it will not close the connection, so the next
> time in the "networkclient.maybeUpdate" it will check if
> isAnyNodeConnecting, then will not connect to any good node the get the
> metadata.  It need about several minutes to
> aware the connecting is timeout and try other node.
>
>
> So I want to add a connect.timeout parameter,  the selector can find the
> connecting is timeout and close the connection.  It seems the currently the
> timeout value passed in `selector.poll()`
> seems can not do this.
>
>
> Thanks,
> David
>
>
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Guozhang Wang";<wa...@gmail.com>;
> 发送时间: 2017年5月16日(星期二) 凌晨1:51
> 收件人: "dev@kafka.apache.org"<de...@kafka.apache.org>;
>
> 主题: Re: [DISCUSS] KIP-148: Add a connect timeout for client
>
>
>
> Hi David,
>
> I may be a bit confused before, just clarifying a few things:
>
> 1. As you mentioned, a client will always try to first establish the
> connection with a broker node before it tries to send any request to it.
> And after connection is established, it will either continuously send many
> requests (e.g. produce) for just a single request (e.g. metadata) to the
> broker, so these two phases are indeed different.
>
> 2. In the connected phase, connections.max.idle.ms is used to
> auto-disconnect the socket if no requests has been sent / received during
> that period of time; in the connecting phase, we always try to create the
> socket via "socketChannel.connect" in a non-blocking call, and then checks
> if the connection has been established, but all the callers of this
> function (in either producer or consumer) has a timeout parameter as in
> `selector.poll()`, and the timeout parameter is set either by calculations
> based on metadata.expiration.time and backoff for producer#sender, or by
> directly passed values from consumer#poll(timeout), so although there is no
> directly config controlling that, users can still control how much time in
> maximum to wait for inside code.
>
> I originally thought your scenarios is more on the connected phase, but now
> I feel you are talking about the connecting phase. For that case, I still
> feel currently the timeout value passed in `selector.poll()` which is
> controllable from user code should be sufficient?
>
>
> Guozhang
>
>
>
>
> On Sun, May 14, 2017 at 2:37 AM, 东方甲乙 <25...@qq.com> wrote:
>
> > Hi Guozhang,
> >
> >
> > Sorry for the delay, thanks for the question.  It seems two different
> > parameters to me:
> > connect.timeout.ms: only work for the connecting phrase, after connected
> > phrase this parameter is not used.
> > connections.max.idle.ms: currently not work in the connecting phrase
> > (only select return readyKeys >0) will add to the expired manager, after
> > connected will check if the connection is still alive in some time.
> >
> >
> > Even if we change the connections.max.idle.ms to work including the
> > connecting phrase, we can not set this parameter to a small value, such
> as
> > 5 seconds. Because the client is maybe busy sending message to other
> node,
> > it will be disconnected in 5 seconds, so the default value of
> > connections.max.idle.ms is setting to a larger time. We should have two
> > parameters to control the connecting phrase behavior and the connected
> > phrase behavior, do you think so?
> >
> >
> > Thanks,
> >
> >
> > David
> >
> >
> >
> >
> > ------------------ 原始邮件 ------------------
> > 发件人: "Guozhang Wang";<wa...@gmail.com>;
> > 发送时间: 2017年5月6日(星期六) 上午7:52
> > 收件人: "dev@kafka.apache.org"<de...@kafka.apache.org>;
> >
> > 主题: Re: [DISCUSS] KIP-148: Add a connect timeout for client
> >
> >
> >
> > Hello David,
> >
> > Thanks for the KIP. For the described issue, I'm wondering if it can be
> > resolved by tuning the CONNECTIONS_MAX_IDLE_MS_CONFIG (
> > connections.max.idle.ms) on the client side? Default is 9 minutes.
> >
> >
> > Guozhang
> >
> > On Tue, May 2, 2017 at 8:22 AM, 东方甲乙 <25...@qq.com> wrote:
> >
> > > Hi all,
> > >
> > > Currently in our test environment, we found that after one of the
> broker
> > > node crash (reboot or os crash), the client may still be connecting to
> > the
> > > crash node to send metadata request or other request, and it needs
> > several
> > > minutes to be aware that the connection is timeout then try another
> node
> > to
> > > connect to send the request. Then the client may still not be aware of
> > the
> > > metadata change after several minutes.
> > >
> > >
> > > So I want to add a connect timeout on the  client,  please take a look
> > at：
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 148%3A+Add+a+connect+timeout+for+client
> > >
> > > Regards,
> > >
> > > David
> >
> >
> >
> >
> > --
> > -- Guozhang
> >
>
>
>
> --
> -- Guozhang
>