You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafficserver.apache.org by Hongfei Zhang <ho...@gmail.com> on 2020/09/30 18:06:39 UTC

Traffic Storm After Parent Proxy Loss & Recovery

Hi Folks,

In a 2-tier CDN production environment using ATS, we recently had a series of issues that I want to reach out to the community for help. 

The symptoms usually shows up  on edge caches as:
	1) elevated CPU - noticeable, but not 'through the roof'
	2) unusually high ingress bw, often order of magnitudes higher than egress. i.e. edge caches is pulling a lot of data from mids but not sending much to end clients. We often saw 3-4Gbps ingress vs. 500Mbps egress. The edge cache uses separate network interfaces - the ingress traffic is mainly from upstream interface. 
	3) huge number of TIME_WAIT sockets on edge cache, 70-80K, all of them destined to mid. This number stays at that level until we kill/restart ATS process. It looks like the edge is open/close sockets to mid, as indicated by the debug log. 
	4) debug log shows SM 702933367 stuck in the loop repeating the following, over and over and over. The regular transaction log, however, does not have an entry for the url this SM is handling.  So it feels like the the stuck loop runs in the ‘background’, GET the same url, without caching or sending to client. 
...
[Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
[Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open connection to x.x.x.x 172.17.173.103:80
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, NET_EVENT_OPEN]
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_http_server_open, NET_EVENT_OPEN]
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_READ_READY]
[Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
[Sep 22 03:54:23.227] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
[Sep 22 03:54:23.231] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.232] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HttpTransact::HandleResponse]
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response received
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HandleResponse] response_received_time: 1600746863
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [is_response_valid] No errors in response
[Sep 22 03:54:23.234] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response valid
[Sep 22 03:54:23.235] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] [initialize_state_variables_from_response]Server is keep-alive.
[Sep 22 03:54:23.236] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
[Sep 22 03:54:23.240] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open connection to x.x.x.x 172.17.173.103:80
[Sep 22 03:54:23.241] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, NET_EVENT_OPEN]
[Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_http_server_open, NET_EVENT_OPEN]
[Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
[Sep 22 03:54:23.243] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.244] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.285] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_READ_READY]
[Sep 22 03:54:23.286] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
[Sep 22 03:54:23.288] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
[Sep 22 03:54:23.289] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HttpTransact::HandleResponse]
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response received
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HandleResponse] response_received_time: 1600746863
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [is_response_valid] No errors in response
[Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response valid
[Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] [initialize_state_variables_from_response]Server is keep-alive.
…
	5) the transaction log on the mid confirms there were request storms for the same url from a the edge, but all resulted in crc=ERR_CLIENT_ABORT, cfsc=INTR. 


Our initial investigation confirmed the this traffic storm situation usually follows a repeated series of parent proxy (mid tier cache) loss and recovery events - recovery is usually 10 seconds after initial mark down, and the loss/recover repeats 3 or 4 times in about a minute. 

The ATS is 7.1.5 based, process running on bare metal server with Centos 7.7. The only way to recover today is to restart ATS process. Obviously this is service impacting.

Was this problem reported before? Any verdict on root cause? Debugging tip or pointers to resolution?

Thanks,
-Hongfei 

Re: Traffic Storm After Parent Proxy Loss & Recovery

Posted by Evan Zelkowitz <ez...@apache.org>.
Yes, the first thing that helped us with connection issues was setting
the origin_max_connections to a reasonable value, either per cache or
per remap depending on traffic but we would always have a cache upper
limit. However in a 2-tier setup if you only do this at one tier you
back up the connections to another tier so it has to be done at both
tiers. Even with that though you can still end up with the close wait
issues with connections hanging around which is where the patch came
in because that addresses that tight spinning loop you were seeing

On Wed, Sep 30, 2020 at 1:43 PM Hongfei Zhang <ho...@gmail.com> wrote:
>
> Thank Evan. In our case, the edge’s
> origin_max_connections: 0
> net.connection_throttle:500000
> parent_proxy.total_connect_attemps:6
> parent_proxy.per_parent_connec_attempts:1
>
> Do you think we are subject to the looping condition this commit addresses?
>
> -Hongfei
>
> > On Sep 30, 2020, at 2:37 PM, Evan Zelkowitz <ez...@apache.org> wrote:
> >
> > This sounds similar to issues we have had in the past. I think this
> > was fixed in 7x with this commit,
> > https://github.com/apache/trafficserver/commit/1da4fbcd5fd32ff86888ed401e0c297b22e00c52
> > . There have been a few other commits past that for other tight loop
> > scenarios when dealing with the connection limit in ATS as well so I'd
> > recommend moving to at least 7.1.8
> >
> > On Wed, Sep 30, 2020 at 12:07 PM Hongfei Zhang <ho...@gmail.com> wrote:
> >>
> >> Hi Folks,
> >>
> >> In a 2-tier CDN production environment using ATS, we recently had a series of issues that I want to reach out to the community for help.
> >>
> >> The symptoms usually shows up  on edge caches as:
> >>        1) elevated CPU - noticeable, but not 'through the roof'
> >>        2) unusually high ingress bw, often order of magnitudes higher than egress. i.e. edge caches is pulling a lot of data from mids but not sending much to end clients. We often saw 3-4Gbps ingress vs. 500Mbps egress. The edge cache uses separate network interfaces - the ingress traffic is mainly from upstream interface.
> >>        3) huge number of TIME_WAIT sockets on edge cache, 70-80K, all of them destined to mid. This number stays at that level until we kill/restart ATS process. It looks like the edge is open/close sockets to mid, as indicated by the debug log.
> >>        4) debug log shows SM 702933367 stuck in the loop repeating the following, over and over and over. The regular transaction log, however, does not have an entry for the url this SM is handling.  So it feels like the the stuck loop runs in the ‘background’, GET the same url, without caching or sending to client.
> >> ...
> >> [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
> >> [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open connection to x.x.x.x 172.17.173.103:80
> >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, NET_EVENT_OPEN]
> >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_http_server_open, NET_EVENT_OPEN]
> >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
> >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
> >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
> >> [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
> >> [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
> >> [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_READ_READY]
> >> [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
> >> [Sep 22 03:54:23.227] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
> >> [Sep 22 03:54:23.231] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
> >> [Sep 22 03:54:23.232] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
> >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HttpTransact::HandleResponse]
> >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response received
> >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HandleResponse] response_received_time: 1600746863
> >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [is_response_valid] No errors in response
> >> [Sep 22 03:54:23.234] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response valid
> >> [Sep 22 03:54:23.235] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] [initialize_state_variables_from_response]Server is keep-alive.
> >> [Sep 22 03:54:23.236] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
> >> [Sep 22 03:54:23.240] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open connection to x.x.x.x 172.17.173.103:80
> >> [Sep 22 03:54:23.241] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, NET_EVENT_OPEN]
> >> [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_http_server_open, NET_EVENT_OPEN]
> >> [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
> >> [Sep 22 03:54:23.243] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
> >> [Sep 22 03:54:23.244] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
> >> [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
> >> [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
> >> [Sep 22 03:54:23.285] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_READ_READY]
> >> [Sep 22 03:54:23.286] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
> >> [Sep 22 03:54:23.288] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
> >> [Sep 22 03:54:23.289] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
> >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
> >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HttpTransact::HandleResponse]
> >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response received
> >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HandleResponse] response_received_time: 1600746863
> >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [is_response_valid] No errors in response
> >> [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response valid
> >> [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] [initialize_state_variables_from_response]Server is keep-alive.
> >> …
> >>        5) the transaction log on the mid confirms there were request storms for the same url from a the edge, but all resulted in crc=ERR_CLIENT_ABORT, cfsc=INTR.
> >>
> >>
> >> Our initial investigation confirmed the this traffic storm situation usually follows a repeated series of parent proxy (mid tier cache) loss and recovery events - recovery is usually 10 seconds after initial mark down, and the loss/recover repeats 3 or 4 times in about a minute.
> >>
> >> The ATS is 7.1.5 based, process running on bare metal server with Centos 7.7. The only way to recover today is to restart ATS process. Obviously this is service impacting.
> >>
> >> Was this problem reported before? Any verdict on root cause? Debugging tip or pointers to resolution?
> >>
> >> Thanks,
> >> -Hongfei
>

Re: Traffic Storm After Parent Proxy Loss & Recovery

Posted by Hongfei Zhang <ho...@gmail.com>.
Thank Evan. In our case, the edge’s
origin_max_connections: 0 
net.connection_throttle:500000
parent_proxy.total_connect_attemps:6
parent_proxy.per_parent_connec_attempts:1 

Do you think we are subject to the looping condition this commit addresses?

-Hongfei

> On Sep 30, 2020, at 2:37 PM, Evan Zelkowitz <ez...@apache.org> wrote:
> 
> This sounds similar to issues we have had in the past. I think this
> was fixed in 7x with this commit,
> https://github.com/apache/trafficserver/commit/1da4fbcd5fd32ff86888ed401e0c297b22e00c52
> . There have been a few other commits past that for other tight loop
> scenarios when dealing with the connection limit in ATS as well so I'd
> recommend moving to at least 7.1.8
> 
> On Wed, Sep 30, 2020 at 12:07 PM Hongfei Zhang <ho...@gmail.com> wrote:
>> 
>> Hi Folks,
>> 
>> In a 2-tier CDN production environment using ATS, we recently had a series of issues that I want to reach out to the community for help.
>> 
>> The symptoms usually shows up  on edge caches as:
>>        1) elevated CPU - noticeable, but not 'through the roof'
>>        2) unusually high ingress bw, often order of magnitudes higher than egress. i.e. edge caches is pulling a lot of data from mids but not sending much to end clients. We often saw 3-4Gbps ingress vs. 500Mbps egress. The edge cache uses separate network interfaces - the ingress traffic is mainly from upstream interface.
>>        3) huge number of TIME_WAIT sockets on edge cache, 70-80K, all of them destined to mid. This number stays at that level until we kill/restart ATS process. It looks like the edge is open/close sockets to mid, as indicated by the debug log.
>>        4) debug log shows SM 702933367 stuck in the loop repeating the following, over and over and over. The regular transaction log, however, does not have an entry for the url this SM is handling.  So it feels like the the stuck loop runs in the ‘background’, GET the same url, without caching or sending to client.
>> ...
>> [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
>> [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open connection to x.x.x.x 172.17.173.103:80
>> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, NET_EVENT_OPEN]
>> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_http_server_open, NET_EVENT_OPEN]
>> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
>> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
>> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
>> [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
>> [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
>> [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_READ_READY]
>> [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
>> [Sep 22 03:54:23.227] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
>> [Sep 22 03:54:23.231] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
>> [Sep 22 03:54:23.232] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
>> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HttpTransact::HandleResponse]
>> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response received
>> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HandleResponse] response_received_time: 1600746863
>> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [is_response_valid] No errors in response
>> [Sep 22 03:54:23.234] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response valid
>> [Sep 22 03:54:23.235] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] [initialize_state_variables_from_response]Server is keep-alive.
>> [Sep 22 03:54:23.236] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
>> [Sep 22 03:54:23.240] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open connection to x.x.x.x 172.17.173.103:80
>> [Sep 22 03:54:23.241] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, NET_EVENT_OPEN]
>> [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_http_server_open, NET_EVENT_OPEN]
>> [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
>> [Sep 22 03:54:23.243] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
>> [Sep 22 03:54:23.244] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
>> [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
>> [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
>> [Sep 22 03:54:23.285] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_READ_READY]
>> [Sep 22 03:54:23.286] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
>> [Sep 22 03:54:23.288] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
>> [Sep 22 03:54:23.289] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
>> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
>> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HttpTransact::HandleResponse]
>> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response received
>> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HandleResponse] response_received_time: 1600746863
>> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [is_response_valid] No errors in response
>> [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response valid
>> [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] [initialize_state_variables_from_response]Server is keep-alive.
>> …
>>        5) the transaction log on the mid confirms there were request storms for the same url from a the edge, but all resulted in crc=ERR_CLIENT_ABORT, cfsc=INTR.
>> 
>> 
>> Our initial investigation confirmed the this traffic storm situation usually follows a repeated series of parent proxy (mid tier cache) loss and recovery events - recovery is usually 10 seconds after initial mark down, and the loss/recover repeats 3 or 4 times in about a minute.
>> 
>> The ATS is 7.1.5 based, process running on bare metal server with Centos 7.7. The only way to recover today is to restart ATS process. Obviously this is service impacting.
>> 
>> Was this problem reported before? Any verdict on root cause? Debugging tip or pointers to resolution?
>> 
>> Thanks,
>> -Hongfei


Re: Traffic Storm After Parent Proxy Loss & Recovery

Posted by Evan Zelkowitz <ez...@apache.org>.
This sounds similar to issues we have had in the past. I think this
was fixed in 7x with this commit,
https://github.com/apache/trafficserver/commit/1da4fbcd5fd32ff86888ed401e0c297b22e00c52
. There have been a few other commits past that for other tight loop
scenarios when dealing with the connection limit in ATS as well so I'd
recommend moving to at least 7.1.8

On Wed, Sep 30, 2020 at 12:07 PM Hongfei Zhang <ho...@gmail.com> wrote:
>
> Hi Folks,
>
> In a 2-tier CDN production environment using ATS, we recently had a series of issues that I want to reach out to the community for help.
>
> The symptoms usually shows up  on edge caches as:
>         1) elevated CPU - noticeable, but not 'through the roof'
>         2) unusually high ingress bw, often order of magnitudes higher than egress. i.e. edge caches is pulling a lot of data from mids but not sending much to end clients. We often saw 3-4Gbps ingress vs. 500Mbps egress. The edge cache uses separate network interfaces - the ingress traffic is mainly from upstream interface.
>         3) huge number of TIME_WAIT sockets on edge cache, 70-80K, all of them destined to mid. This number stays at that level until we kill/restart ATS process. It looks like the edge is open/close sockets to mid, as indicated by the debug log.
>         4) debug log shows SM 702933367 stuck in the loop repeating the following, over and over and over. The regular transaction log, however, does not have an entry for the url this SM is handling.  So it feels like the the stuck loop runs in the ‘background’, GET the same url, without caching or sending to client.
> ...
> [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
> [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open connection to x.x.x.x 172.17.173.103:80
> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, NET_EVENT_OPEN]
> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_http_server_open, NET_EVENT_OPEN]
> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
> [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
> [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
> [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_READ_READY]
> [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
> [Sep 22 03:54:23.227] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
> [Sep 22 03:54:23.231] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
> [Sep 22 03:54:23.232] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HttpTransact::HandleResponse]
> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response received
> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HandleResponse] response_received_time: 1600746863
> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [is_response_valid] No errors in response
> [Sep 22 03:54:23.234] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response valid
> [Sep 22 03:54:23.235] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] [initialize_state_variables_from_response]Server is keep-alive.
> [Sep 22 03:54:23.236] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
> [Sep 22 03:54:23.240] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open connection to x.x.x.x 172.17.173.103:80
> [Sep 22 03:54:23.241] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, NET_EVENT_OPEN]
> [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_http_server_open, NET_EVENT_OPEN]
> [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
> [Sep 22 03:54:23.243] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
> [Sep 22 03:54:23.244] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
> [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
> [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
> [Sep 22 03:54:23.285] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_READ_READY]
> [Sep 22 03:54:23.286] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
> [Sep 22 03:54:23.288] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
> [Sep 22 03:54:23.289] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE]
> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE]
> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HttpTransact::HandleResponse]
> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response received
> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HandleResponse] response_received_time: 1600746863
> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [is_response_valid] No errors in response
> [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response valid
> [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] [initialize_state_variables_from_response]Server is keep-alive.
> …
>         5) the transaction log on the mid confirms there were request storms for the same url from a the edge, but all resulted in crc=ERR_CLIENT_ABORT, cfsc=INTR.
>
>
> Our initial investigation confirmed the this traffic storm situation usually follows a repeated series of parent proxy (mid tier cache) loss and recovery events - recovery is usually 10 seconds after initial mark down, and the loss/recover repeats 3 or 4 times in about a minute.
>
> The ATS is 7.1.5 based, process running on bare metal server with Centos 7.7. The only way to recover today is to restart ATS process. Obviously this is service impacting.
>
> Was this problem reported before? Any verdict on root cause? Debugging tip or pointers to resolution?
>
> Thanks,
> -Hongfei