You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "Susan Hinrichs (JIRA)" <ji...@apache.org> on 2016/04/22 17:29:13 UTC

[jira] [Commented] (TS-4372) Traffic server heart beat fails with 6.1

    [ https://issues.apache.org/jira/browse/TS-4372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254096#comment-15254096 ] 

Susan Hinrichs commented on TS-4372:
------------------------------------

Tracked down the problem to the active_queue getting corrupted by unintentional multi threaded activity.  I put in asserts that the vc->thread was the same as the current thread as the active_queue manipulation methods were being called.  The assert went off very quickly from a Http2 call via FetchSM.

I then moved to 6.2.x which includes ts-3612 to eliminate FetchSM, and the assert no longer goes off and I don't see the heartbeat failures.  However, on that build the number of sockets grows.  I assume that we are missing inactivity timeouts for client side connections.  And some clients don't initiate the close for a very long time.  I ran with Http2 disabled (and SPDY not built), so the leak occurs from Http1.x only traffic as well.

I must move onto other things today, so I'm going to reinstall the 6.1 build and disable Http2 and spdy to verify that they were the cause of the multithreading.

[~bcall] if you have some spare cycles, could you review the keep_alive_queue/active_queue logic on 6.2.x?  Perhaps I messed things up with the ts-3612 integration.

> Traffic server heart beat fails with 6.1
> ----------------------------------------
>
>                 Key: TS-4372
>                 URL: https://issues.apache.org/jira/browse/TS-4372
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Cop, Manager
>            Reporter: Susan Hinrichs
>            Assignee: Susan Hinrichs
>         Attachments: ts-4372-example.pcap
>
>
> When running 6.1 in a loaded production environment, traffic server will run for a while (30 minutes or so), then server heart beats will start failing intermittently.  Eventually two will fail in a row causing the traffic_cop to restart traffic_server (or traffic_manager and then traffic_server I'm still a bit unclear there).
> {code}
> traffic_cop[18078]: (test) read failed [104 'Connection reset by peer']
> {code}
> There are no particular resource limitations on the production machine in this state.  The number of open sockets is around 50-60K which is consistent with its 5.3.x peer.  The memory usage is no where near the limit.  The CPU usage is high, but again, not near the limit (perhaps half the entire machine usage).
> If we look at the packets exchanged on the loopback interface during this heartbeat failing interval, we see some interesting things.  I'll attach an example pcap file.   The interesting traffic is on port 8084 and 8083.  Traffic_cop sends a GET http://127.0.0.1:8083/synthetic.txt request to traffic_server over port 8084.  Traffic server should proxy the request and send the request GET /synthetic.txt to traffic_manager listing on port 8083.  Traffic manager returns a 200 response with some data.  Traffic_server relays that response to traffic_cop.
> However, in the failure cases, traffic_cop sends the request and traffic_manager sends a RESET after the connection has been established and the request has been sent to it.   I'm guessing that there is logic in traffic_server that closes the socket before reading the get request causing the reset to be sent.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)