You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@trafficserver.apache.org by Nick Muerdter <st...@nickm.org> on 2015/10/04 18:16:33 UTC

TrafficServer 6, keep-alive, connection retries, and 502 Server Hangups

Hi,

I've observed some differences in how TrafficServer 6.0.0 behaves with
connection retrying and outgoing keep-alive connections. I believe the
changes in behavior might be related to this issue:
https://issues.apache.org/jira/browse/TS-3440 However, I wasn't sure if
the new behavior (specifically around keep-alive handling) was
intentional or not, so I thought I'd ping the mailing list.

What I'm seeing in 6.0.0 is that if TrafficServer has some backend
keep-alive connections already opened, but then one of the keep-alive
connections is closed, the next request to TrafficServer may generate a
502 Server Hangup response when attempting to reuse that connection.
Previously, I think TrafficServer was retrying when it encountered a
closed keep-alive connection, but that is no longer the case. So if you
have a backend that might unexpectedly close its open keep-alive
connections, the only way I've found to completely prevent these 502
errors in 6.0.0 is to disable outgoing keepalive
(proxy.config.http.keep_alive_enabled_out and
proxy.config.http.keep_alive_post_out settings).

For a slightly more concrete example of what can trigger this, this is
fairly easy to reproduce with the following setup:

- TrafficServer is proxying to nginx with outgoing keep-alive
connections enabled (the default).
- Throw a constant stream of requests at TrafficServer.
- While that constant stream of requests is happening, also send a
regular stream of SIGHUP commands to nginx to reload nginx.
- Eventually you'll get some 502 Server Hangup responses from
TrafficServer among your stream of requests.

SIGHUPs in nginx should result in zero downtime for new requests, but I
think what's happening is that TrafficServer may fail when an old
keep-alived connection is reused (it's not common, so it depends on the
timing of things and if the connection is from an old nginx worker that
has since been shut down). In TrafficServer 5.3.1 these connection
failures were retried, but in 6.0.0, no retries occur in this case.

Here's some debug logs that show the difference in behavior between
6.0.0 and 5.3.1. Note that differences seem to stem from how each
version eventually handles the "VC_EVENT_EOS" event following
"&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".

5.3.1:
https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
6.0.0:
https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314

Interestingly, if I'm understand the log files correctly, it looks like
TraffficServer is reporting an odd empty response from these connections
("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I
can tell from TCP dumps on the system, nginx is not actually sending any
form of response.

So my basic question is whether the new behavior in 6.0.0 is correct or
not. Based on the discussion in
https://issues.apache.org/jira/browse/TS-3440 I'm unsure whether 5.3.1
retrying on these closed keep-alive connections was actually safe or
not. In these example cases the backend server isn't sending back any
data (at least as far as I can tell), so from what I understand, it
should be safe to retry. However, I'm not totally sure that this
situation with dead keep-alive connections can properly be distinguished
between other types of hangups or connection errors, so perhaps it isn't
safe.

If the 6.0.0 behavior is correct, is disabling outgoing keep-alive
connections the best option if I'm worried about backend services
unexpectedly killing off old keep-alive connections? Or is this a bug
with 6.0.0, and should TrafficServer retires technically be possible in
these cases?

Thanks!
Nick

Re:Re: TrafficServer 6, keep-alive, connection retries, and 502 Server Hangups

Posted by Esmq <es...@163.com>.

i had encountered the same problem ~
and i tring to disabled OS keepalived to relieve the problem。


At 2015-10-07 07:19:26, "Nick Muerdter" <st...@nickm.org> wrote:
>On Tue, Oct 6, 2015, at 04:33 PM, James Peach wrote:
>> 
>> > On Oct 4, 2015, at 9:16 AM, Nick Muerdter <st...@nickm.org> wrote:
>> > 
>> > Hi,
>> > 
>> > I've observed some differences in how TrafficServer 6.0.0 behaves with
>> > connection retrying and outgoing keep-alive connections. I believe the
>> > changes in behavior might be related to this issue:
>> > https://issues.apache.org/jira/browse/TS-3440 However, I wasn't sure if
>> > the new behavior (specifically around keep-alive handling) was
>> > intentional or not, so I thought I'd ping the mailing list.
>> > 
>> > What I'm seeing in 6.0.0 is that if TrafficServer has some backend
>> > keep-alive connections already opened, but then one of the keep-alive
>> > connections is closed, the next request to TrafficServer may generate a
>> > 502 Server Hangup response when attempting to reuse that connection.
>> > Previously, I think TrafficServer was retrying when it encountered a
>> > closed keep-alive connection, but that is no longer the case. So if you
>> > have a backend that might unexpectedly close its open keep-alive
>> > connections, the only way I've found to completely prevent these 502
>> > errors in 6.0.0 is to disable outgoing keepalive
>> > (proxy.config.http.keep_alive_enabled_out and
>> > proxy.config.http.keep_alive_post_out settings).
>> > 
>> > For a slightly more concrete example of what can trigger this, this is
>> > fairly easy to reproduce with the following setup:
>> > 
>> > - TrafficServer is proxying to nginx with outgoing keep-alive
>> > connections enabled (the default).
>> > - Throw a constant stream of requests at TrafficServer.
>> > - While that constant stream of requests is happening, also send a
>> > regular stream of SIGHUP commands to nginx to reload nginx.
>> > - Eventually you'll get some 502 Server Hangup responses from
>> > TrafficServer among your stream of requests.
>> > 
>> > SIGHUPs in nginx should result in zero downtime for new requests, but I
>> > think what's happening is that TrafficServer may fail when an old
>> > keep-alived connection is reused (it's not common, so it depends on the
>> > timing of things and if the connection is from an old nginx worker that
>> > has since been shut down). In TrafficServer 5.3.1 these connection
>> > failures were retried, but in 6.0.0, no retries occur in this case.
>> > 
>> > Here's some debug logs that show the difference in behavior between
>> > 6.0.0 and 5.3.1. Note that differences seem to stem from how each
>> > version eventually handles the "VC_EVENT_EOS" event following
>> > "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
>> > 
>> > 5.3.1:
>> > https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
>> > 6.0.0:
>> > https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
>> > 
>> > Interestingly, if I'm understand the log files correctly, it looks like
>> > TraffficServer is reporting an odd empty response from these connections
>> > ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I
>> > can tell from TCP dumps on the system, nginx is not actually sending any
>> > form of response.
>> > 
>> > So my basic question is whether the new behavior in 6.0.0 is correct or
>> > not. Based on the discussion in
>> > https://issues.apache.org/jira/browse/TS-3440 I'm unsure whether 5.3.1
>> > retrying on these closed keep-alive connections was actually safe or
>> > not. In these example cases the backend server isn't sending back any
>> > data (at least as far as I can tell), so from what I understand, it
>> > should be safe to retry. However, I'm not totally sure that this
>> > situation with dead keep-alive connections can properly be distinguished
>> > between other types of hangups or connection errors, so perhaps it isn't
>> > safe.
>> > 
>> > If the 6.0.0 behavior is correct, is disabling outgoing keep-alive
>> > connections the best option if I'm worried about backend services
>> > unexpectedly killing off old keep-alive connections? Or is this a bug
>> > with 6.0.0, and should TrafficServer retires technically be possible in
>> > these cases?
>> 
>> Hi Nick,
>> 
>> This sounds like a 6.0 regression to me. Can you file the above
>> information in Jira?
>> 
>> thanks,
>> James
>
>Thanks for the sanity check! I've filed an issue:
>https://issues.apache.org/jira/browse/TS-3959

Re: TrafficServer 6, keep-alive, connection retries, and 502 Server Hangups

Posted by Nick Muerdter <st...@nickm.org>.

On Tue, Oct 6, 2015, at 04:33 PM, James Peach wrote:
> 
> > On Oct 4, 2015, at 9:16 AM, Nick Muerdter <st...@nickm.org> wrote:
> > 
> > Hi,
> > 
> > I've observed some differences in how TrafficServer 6.0.0 behaves with
> > connection retrying and outgoing keep-alive connections. I believe the
> > changes in behavior might be related to this issue:
> > https://issues.apache.org/jira/browse/TS-3440 However, I wasn't sure if
> > the new behavior (specifically around keep-alive handling) was
> > intentional or not, so I thought I'd ping the mailing list.
> > 
> > What I'm seeing in 6.0.0 is that if TrafficServer has some backend
> > keep-alive connections already opened, but then one of the keep-alive
> > connections is closed, the next request to TrafficServer may generate a
> > 502 Server Hangup response when attempting to reuse that connection.
> > Previously, I think TrafficServer was retrying when it encountered a
> > closed keep-alive connection, but that is no longer the case. So if you
> > have a backend that might unexpectedly close its open keep-alive
> > connections, the only way I've found to completely prevent these 502
> > errors in 6.0.0 is to disable outgoing keepalive
> > (proxy.config.http.keep_alive_enabled_out and
> > proxy.config.http.keep_alive_post_out settings).
> > 
> > For a slightly more concrete example of what can trigger this, this is
> > fairly easy to reproduce with the following setup:
> > 
> > - TrafficServer is proxying to nginx with outgoing keep-alive
> > connections enabled (the default).
> > - Throw a constant stream of requests at TrafficServer.
> > - While that constant stream of requests is happening, also send a
> > regular stream of SIGHUP commands to nginx to reload nginx.
> > - Eventually you'll get some 502 Server Hangup responses from
> > TrafficServer among your stream of requests.
> > 
> > SIGHUPs in nginx should result in zero downtime for new requests, but I
> > think what's happening is that TrafficServer may fail when an old
> > keep-alived connection is reused (it's not common, so it depends on the
> > timing of things and if the connection is from an old nginx worker that
> > has since been shut down). In TrafficServer 5.3.1 these connection
> > failures were retried, but in 6.0.0, no retries occur in this case.
> > 
> > Here's some debug logs that show the difference in behavior between
> > 6.0.0 and 5.3.1. Note that differences seem to stem from how each
> > version eventually handles the "VC_EVENT_EOS" event following
> > "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
> > 
> > 5.3.1:
> > https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
> > 6.0.0:
> > https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
> > 
> > Interestingly, if I'm understand the log files correctly, it looks like
> > TraffficServer is reporting an odd empty response from these connections
> > ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I
> > can tell from TCP dumps on the system, nginx is not actually sending any
> > form of response.
> > 
> > So my basic question is whether the new behavior in 6.0.0 is correct or
> > not. Based on the discussion in
> > https://issues.apache.org/jira/browse/TS-3440 I'm unsure whether 5.3.1
> > retrying on these closed keep-alive connections was actually safe or
> > not. In these example cases the backend server isn't sending back any
> > data (at least as far as I can tell), so from what I understand, it
> > should be safe to retry. However, I'm not totally sure that this
> > situation with dead keep-alive connections can properly be distinguished
> > between other types of hangups or connection errors, so perhaps it isn't
> > safe.
> > 
> > If the 6.0.0 behavior is correct, is disabling outgoing keep-alive
> > connections the best option if I'm worried about backend services
> > unexpectedly killing off old keep-alive connections? Or is this a bug
> > with 6.0.0, and should TrafficServer retires technically be possible in
> > these cases?
> 
> Hi Nick,
> 
> This sounds like a 6.0 regression to me. Can you file the above
> information in Jira?
> 
> thanks,
> James

Thanks for the sanity check! I've filed an issue:
https://issues.apache.org/jira/browse/TS-3959

Re: TrafficServer 6, keep-alive, connection retries, and 502 Server Hangups

Posted by James Peach <jp...@apache.org>.

> On Oct 4, 2015, at 9:16 AM, Nick Muerdter <st...@nickm.org> wrote:
> 
> Hi,
> 
> I've observed some differences in how TrafficServer 6.0.0 behaves with
> connection retrying and outgoing keep-alive connections. I believe the
> changes in behavior might be related to this issue:
> https://issues.apache.org/jira/browse/TS-3440 However, I wasn't sure if
> the new behavior (specifically around keep-alive handling) was
> intentional or not, so I thought I'd ping the mailing list.
> 
> What I'm seeing in 6.0.0 is that if TrafficServer has some backend
> keep-alive connections already opened, but then one of the keep-alive
> connections is closed, the next request to TrafficServer may generate a
> 502 Server Hangup response when attempting to reuse that connection.
> Previously, I think TrafficServer was retrying when it encountered a
> closed keep-alive connection, but that is no longer the case. So if you
> have a backend that might unexpectedly close its open keep-alive
> connections, the only way I've found to completely prevent these 502
> errors in 6.0.0 is to disable outgoing keepalive
> (proxy.config.http.keep_alive_enabled_out and
> proxy.config.http.keep_alive_post_out settings).
> 
> For a slightly more concrete example of what can trigger this, this is
> fairly easy to reproduce with the following setup:
> 
> - TrafficServer is proxying to nginx with outgoing keep-alive
> connections enabled (the default).
> - Throw a constant stream of requests at TrafficServer.
> - While that constant stream of requests is happening, also send a
> regular stream of SIGHUP commands to nginx to reload nginx.
> - Eventually you'll get some 502 Server Hangup responses from
> TrafficServer among your stream of requests.
> 
> SIGHUPs in nginx should result in zero downtime for new requests, but I
> think what's happening is that TrafficServer may fail when an old
> keep-alived connection is reused (it's not common, so it depends on the
> timing of things and if the connection is from an old nginx worker that
> has since been shut down). In TrafficServer 5.3.1 these connection
> failures were retried, but in 6.0.0, no retries occur in this case.
> 
> Here's some debug logs that show the difference in behavior between
> 6.0.0 and 5.3.1. Note that differences seem to stem from how each
> version eventually handles the "VC_EVENT_EOS" event following
> "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
> 
> 5.3.1:
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
> 6.0.0:
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
> 
> Interestingly, if I'm understand the log files correctly, it looks like
> TraffficServer is reporting an odd empty response from these connections
> ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I
> can tell from TCP dumps on the system, nginx is not actually sending any
> form of response.
> 
> So my basic question is whether the new behavior in 6.0.0 is correct or
> not. Based on the discussion in
> https://issues.apache.org/jira/browse/TS-3440 I'm unsure whether 5.3.1
> retrying on these closed keep-alive connections was actually safe or
> not. In these example cases the backend server isn't sending back any
> data (at least as far as I can tell), so from what I understand, it
> should be safe to retry. However, I'm not totally sure that this
> situation with dead keep-alive connections can properly be distinguished
> between other types of hangups or connection errors, so perhaps it isn't
> safe.
> 
> If the 6.0.0 behavior is correct, is disabling outgoing keep-alive
> connections the best option if I'm worried about backend services
> unexpectedly killing off old keep-alive connections? Or is this a bug
> with 6.0.0, and should TrafficServer retires technically be possible in
> these cases?

Hi Nick,

This sounds like a 6.0 regression to me. Can you file the above information in Jira?

thanks,
James