You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/10/03 15:18:20 UTC

[jira] [Work logged] (TS-4509) Dropped keep-alive connections not being re-established (TS-3959 continued)

     [ https://issues.apache.org/jira/browse/TS-4509?focusedWorklogId=30070&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-30070 ]

ASF GitHub Bot logged work on TS-4509:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 03/Oct/16 15:17
            Start Date: 03/Oct/16 15:17
    Worklog Time Spent: 10m 
      Work Description: GitHub user jacksontj opened a pull request:

    https://github.com/apache/trafficserver/pull/1070

    TS-4509 Add `outstanding_bytes` to VConnection

    With this we can better check request retryability. This (in addition to not releasing the sessions immediately on error) means that if the request is retryable we can simply check if the number of bytes queued is the same as the number of bytes we've asked to write. If these match, then we can be sure we didn't send any ACKd packets-- meaning we are completely safe to retry.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jacksontj/trafficserver TS-3959

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/trafficserver/pull/1070.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1070
    
----
commit 760565954c2b7a9bb747d8e399a6b412b27466f0
Author: Thomas Jackson <ja...@gmail.com>
Date:   2016-10-03T15:16:28Z

    TS-4509 Add `outstanding_bytes` to VConnection
    
    With this we can better check request retryability. This (in addition to not releasing the sessions immediately on error) means that if the request is retryable we can simply check if the number of bytes queued is the same as the number of bytes we've asked to write. If these match, then we can be sure we didn't send any ACKd packets-- meaning we are completely safe to retry.

----


Issue Time Tracking
-------------------

            Worklog Id:     (was: 30070)
            Time Spent: 10m
    Remaining Estimate: 0h

> Dropped keep-alive connections not being re-established (TS-3959 continued)
> ---------------------------------------------------------------------------
>
>                 Key: TS-4509
>                 URL: https://issues.apache.org/jira/browse/TS-4509
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core, Network
>            Reporter: Thomas Jackson
>            Assignee: Thomas Jackson
>            Priority: Blocker
>             Fix For: 7.1.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I've observed some differences in how TrafficServer 6.0.0 behaves with connection retrying and outgoing keep-alive connections. I believe the changes in behavior might be related to this issue: https://issues.apache.org/jira/browse/TS-3440
> I originally wasn't sure if this was a bug, but James Peach indicated it sounded more like a regression on the mailing list (http://mail-archives.apache.org/mod_mbox/trafficserver-users/201510.mbox/%3cBA85D5A2-8B29-44A9-ACDC-E7FA8D21FC69@apache.org%3e).
> What I'm seeing in 6.0.0 is that if TrafficServer has some backend keep-alive connections already opened, but then one of the keep-alive connections is closed, the next request to TrafficServer may generate a 502 Server Hangup response when attempting to reuse that connection. Previously, I think TrafficServer was retrying when it encountered a closed keep-alive connection, but that is no longer the case. So if you have a backend that might unexpectedly close its open keep-alive connections, the only way I've found to completely prevent these 502 errors in 6.0.0 is to disable outgoing keepalive (proxy.config.http.keep_alive_enabled_out and proxy.config.http.keep_alive_post_out settings).
> For a slightly more concrete example of what can trigger this, this is fairly easy to reproduce with the following setup:
> - TrafficServer is proxying to nginx with outgoing keep-alive connections enabled (the default).
> - Throw a constant stream of requests at TrafficServer.
> - While that constant stream of requests is happening, also send a regular stream of SIGHUP commands to nginx to reload nginx.
> - Eventually you'll get some 502 Server Hangup responses from TrafficServer among your stream of requests.
> SIGHUPs in nginx should result in zero downtime for new requests, but I think what's happening is that TrafficServer may fail when an old keep-alived connection is reused (it's not common, so it depends on the timing of things and if the connection is from an old nginx worker that has since been shut down). In TrafficServer 5.3.1 these connection failures were retried, but in 6.0.0, no retries occur in this case.
> Here's some debug logs that show the difference in behavior between 6.0.0 and 5.3.1. Note that differences seem to stem from how each version eventually handles the "VC_EVENT_EOS" event following "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
> 5.3.1: https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
> 6.0.0: https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
> Interestingly, if I'm understand the log files correctly, it looks like TraffficServer is reporting an odd empty response from these connections ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I can tell from TCP dumps on the system, nginx is not actually sending any form of response.
> In these example cases the backend server isn't sending back any data (at least as far as I can tell), so from what I understand (and the logic outlined in https://issues.apache.org/jira/browse/TS-3440), it should be safe to retry.
> Let me know if I can provide any other details. Or if exact scripts to reproduce the issues against the example nginx backend I described above would be useful, I could get that together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)