You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@trafficserver.apache.org by Mateusz Zajakala <za...@gmail.com> on 2017/12/28 20:58:01 UTC

accepting large number of inbound TCP connections

Hi,

I'm trying to optimize the throughput of ATS 6.2.0 running on 16G / 8 cores
server. ATS handles up to 7 Gbps of traffic (circa 500 requests / second)
serving up to 80% of traffic from ram-disk-based cache.

The problem I'm seeing is that from time to time my http clients can't
connect to the server reasonably fast (I define this as 1 second to
establish TCP conn). Unfortunately, http keep alive is not used by clients,
so those 500 request / second are all made over new TCP connections.
Clients connect, retrieve the file and disconnect. I do realize the
overheads, but this is not something I can easily change (client-side)...

I'm wondering what I can do to improve the performance and eliminate those
failed connection attempts. Some ideas I have tried
- 30000 connection throttle in records.config (afaik this also sets the max
no of open files for ATS)
- tcp_fin_timeout is set to 1 - I'm not running out of ports because of
sockets stuck in TIME_WAIT, I have checked. At any given time I have no
more than 1k TCP connections open

Unfortunately, I'm not sure where these incomng connections are
dropped/stuck and I'm not sure which TCP stats would help understanding
this. I have also not tweaked around default Centos 7 TCP settings as I
don't feel competent enough.

One thing that caught my attention is proxy.config.accept_threads value set
to 1 (default). This seems really low given the traffic, but I read
somewhere that it's best left at that. Can you please comment on that?
Shouldn't this value be adjusted (e.g. 4 or more)? Or even move the accepts
to worker threads?

I'm not seeing any meaningful errors in ATS logs, but there are no debug
tsgs enabled. Any suggestion on how to debug / improve much appreciated.

Thanks
Mateusz

Re: accepting large number of inbound TCP connections

Posted by Veiko Kukk <ve...@gmail.com>.

Hi Mateusz

When you run ab against your ATS to create high enough artificial load, do
all requests succeed and how quickly? Using some tool like wireshark during
ab testing to dump and analyze TCP traffic could then give you hint where
exactly the failure happens.
To exclude any network bottlenecks, run ab locally on server.

Veiko


2017-12-28 22:58 GMT+02:00 Mateusz Zajakala <za...@gmail.com>:

> Hi,
>
> I'm trying to optimize the throughput of ATS 6.2.0 running on 16G / 8
> cores server. ATS handles up to 7 Gbps of traffic (circa 500 requests /
> second) serving up to 80% of traffic from ram-disk-based cache.
>
> The problem I'm seeing is that from time to time my http clients can't
> connect to the server reasonably fast (I define this as 1 second to
> establish TCP conn). Unfortunately, http keep alive is not used by clients,
> so those 500 request / second are all made over new TCP connections.
> Clients connect, retrieve the file and disconnect. I do realize the
> overheads, but this is not something I can easily change (client-side)...
>
> I'm wondering what I can do to improve the performance and eliminate those
> failed connection attempts. Some ideas I have tried
> - 30000 connection throttle in records.config (afaik this also sets the
> max no of open files for ATS)
> - tcp_fin_timeout is set to 1 - I'm not running out of ports because of
> sockets stuck in TIME_WAIT, I have checked. At any given time I have no
> more than 1k TCP connections open
>
> Unfortunately, I'm not sure where these incomng connections are
> dropped/stuck and I'm not sure which TCP stats would help understanding
> this. I have also not tweaked around default Centos 7 TCP settings as I
> don't feel competent enough.
>
> One thing that caught my attention is proxy.config.accept_threads value
> set to 1 (default). This seems really low given the traffic, but I read
> somewhere that it's best left at that. Can you please comment on that?
> Shouldn't this value be adjusted (e.g. 4 or more)? Or even move the accepts
> to worker threads?
>
> I'm not seeing any meaningful errors in ATS logs, but there are no debug
> tsgs enabled. Any suggestion on how to debug / improve much appreciated.
>
> Thanks
> Mateusz
>
>
>

Re: accepting large number of inbound TCP connections

Posted by Leif Hedstrom <zw...@apache.org>.

In addition to all the other ideas given here, you should test 7.1.x. We have fixed many issues around origin connectivity there. Not saying it will fix this, but it’s worth a shot, and it’s where we are focusing development efforts.

— Leif 

> On Dec 29, 2017, at 12:44 PM, David Boreham <da...@bozemanpass.com> wrote:
> 
> I should say that I don't know much about ATS but I have spent some time looking into similar problems with other servers over the years. Some ideas below:
> 
>> On 12/29/2017 3:56 AM, Mateusz Zajakala wrote:
>> CPU utilization does not exceed 40% during peak traffic. I also checked the number of sockets in connection
> Note that 40% aggregate CPU on a many-core system can easily hide a saturated single thread. If under your workload the server ends up doing much work in a single thread, that can starve overall throughput. e.g. on your 8-core box one thread maxing out a core would only show up as 12.5% -- obviously lower than your observed 40%.
>> pending state (SYN_RECV) and it never goes above 20, so I suppose accepting incoming connections is not the bottleneck.
>> 
>> What about the number of worker threads? I'm using autoconfig with default scale factor (1.5) which on my system (8 cores) creates 27 threads for traffic_server. Does it make sense to increase the scale factor if my CPU utilization is not high? will this improve the overall performance? What about stacksize?
>> 
> I would recommend first gathering some data along the lines of "ok, so what _is_ it doing?" rather than theorizing about solutions. For example use "pstack", or a similar tool to snapshot the ATS process' thread stacks at full-load. Take a few such samples and look at them to see what it is up to. If you see for example all the threads busy doing work then that might be good supporting evidence for making a thread pool larger. or, is the accept thread always running (indicating the incoming accept workload has saturated one core). I suspect there are various counters and such that will be maintained by the ATS code and can be inspected on a live server -- typically these will give you some idea what is happening (e.g. work is queuing up waiting on threads).
> 
> A good way to think through a problem like this is to try to imagine what the server should be doing under the load you have. Once you have that mental picture, go look at what it is actually doing and see what's different.
>> How should I go on about finding the cause of some of the clients not being able to connect occasionally?
> 
> See if you can reproduce the problem yourself with a test client (e.g. curl/wget). If you can then good : now work to "trace" what is happening with the packets from that client. You can use a netfilter/tcpdump filter to target only its IP or MAC to isolate the traffic you want to look at vs the deluge with low overhead. This should tell you if the stall is occurring at the NIC or in the kernel or in user space. To dig into what's going on in user space use logging (I assume but don't know for sure that ATS can be made to log the client IP). If you need more information to debug than existing logging will give you : add new code to log useful information for your investigation.
> 
> If you can't reproduce the issue with your own client, well that's not great, but you can attempt to work "backwards" to a reproduced case by capturing all or a decent sample of the network traffic then analyzing it statically to find examples.
> 
>

Re: accepting large number of inbound TCP connections

Posted by David Boreham <da...@bozemanpass.com>.

I should say that I don't know much about ATS but I have spent some time 
looking into similar problems with other servers over the years. Some 
ideas below:

On 12/29/2017 3:56 AM, Mateusz Zajakala wrote:
> CPU utilization does not exceed 40% during peak traffic. I also 
> checked the number of sockets in connection
Note that 40% aggregate CPU on a many-core system can easily hide a 
saturated single thread. If under your workload the server ends up doing 
much work in a single thread, that can starve overall throughput. e.g. 
on your 8-core box one thread maxing out a core would only show up as 
12.5% -- obviously lower than your observed 40%.
> pending state (SYN_RECV) and it never goes above 20, so I suppose 
> accepting incoming connections is not the bottleneck.
>
> What about the number of worker threads? I'm using autoconfig with 
> default scale factor (1.5) which on my system (8 cores) creates 27 
> threads for traffic_server. Does it make sense to increase the scale 
> factor if my CPU utilization is not high? will this improve the 
> overall performance? What about stacksize?
>
I would recommend first gathering some data along the lines of "ok, so 
what _is_ it doing?" rather than theorizing about solutions. For example 
use "pstack", or a similar tool to snapshot the ATS process' thread 
stacks at full-load. Take a few such samples and look at them to see 
what it is up to. If you see for example all the threads busy doing work 
then that might be good supporting evidence for making a thread pool 
larger. or, is the accept thread always running (indicating the incoming 
accept workload has saturated one core). I suspect there are various 
counters and such that will be maintained by the ATS code and can be 
inspected on a live server -- typically these will give you some idea 
what is happening (e.g. work is queuing up waiting on threads).

A good way to think through a problem like this is to try to imagine 
what the server should be doing under the load you have. Once you have 
that mental picture, go look at what it is actually doing and see what's 
different.
> How should I go on about finding the cause of some of the clients not 
> being able to connect occasionally?

See if you can reproduce the problem yourself with a test client (e.g. 
curl/wget). If you can then good : now work to "trace" what is happening 
with the packets from that client. You can use a netfilter/tcpdump 
filter to target only its IP or MAC to isolate the traffic you want to 
look at vs the deluge with low overhead. This should tell you if the 
stall is occurring at the NIC or in the kernel or in user space. To dig 
into what's going on in user space use logging (I assume but don't know 
for sure that ATS can be made to log the client IP). If you need more 
information to debug than existing logging will give you : add new code 
to log useful information for your investigation.

If you can't reproduce the issue with your own client, well that's not 
great, but you can attempt to work "backwards" to a reproduced case by 
capturing all or a decent sample of the network traffic then analyzing 
it statically to find examples.

Re: accepting large number of inbound TCP connections

Posted by Mateusz Zajakala <za...@gmail.com>.

CPU utilization does not exceed 40% during peak traffic. I also checked the
number of sockets in connection pending state (SYN_RECV) and it never goes
above 20, so I suppose accepting incoming connections is not the bottleneck.

What about the number of worker threads? I'm using autoconfig with default
scale factor (1.5) which on my system (8 cores) creates 27 threads for
traffic_server. Does it make sense to increase the scale factor if my CPU
utilization is not high? will this improve the overall performance? What
about stacksize?

How should I go on about finding the cause of some of the clients not being
able to connect occasionally?

Thanks
Mateusz

On Fri, Dec 29, 2017 at 7:53 AM, John Plevyak <jp...@gmail.com> wrote:

>
> What is your CPU utilization?  I would think you would be mostly idle in
> which case it isn't a problem with the accept thread.  The reason there is
> only one accept thread is that in the past more than one has resulted in
> lock contention in the OS around the single file descriptor for the accept
> port, and the accept thread does nothing but accept() and queue the new
> connection on net worker threads.
>
>
>
> On Thu, Dec 28, 2017 at 12:58 PM, Mateusz Zajakala <za...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to optimize the throughput of ATS 6.2.0 running on 16G / 8
>> cores server. ATS handles up to 7 Gbps of traffic (circa 500 requests /
>> second) serving up to 80% of traffic from ram-disk-based cache.
>>
>> The problem I'm seeing is that from time to time my http clients can't
>> connect to the server reasonably fast (I define this as 1 second to
>> establish TCP conn). Unfortunately, http keep alive is not used by clients,
>> so those 500 request / second are all made over new TCP connections.
>> Clients connect, retrieve the file and disconnect. I do realize the
>> overheads, but this is not something I can easily change (client-side)...
>>
>> I'm wondering what I can do to improve the performance and eliminate
>> those failed connection attempts. Some ideas I have tried
>> - 30000 connection throttle in records.config (afaik this also sets the
>> max no of open files for ATS)
>> - tcp_fin_timeout is set to 1 - I'm not running out of ports because of
>> sockets stuck in TIME_WAIT, I have checked. At any given time I have no
>> more than 1k TCP connections open
>>
>> Unfortunately, I'm not sure where these incomng connections are
>> dropped/stuck and I'm not sure which TCP stats would help understanding
>> this. I have also not tweaked around default Centos 7 TCP settings as I
>> don't feel competent enough.
>>
>> One thing that caught my attention is proxy.config.accept_threads value
>> set to 1 (default). This seems really low given the traffic, but I read
>> somewhere that it's best left at that. Can you please comment on that?
>> Shouldn't this value be adjusted (e.g. 4 or more)? Or even move the accepts
>> to worker threads?
>>
>> I'm not seeing any meaningful errors in ATS logs, but there are no debug
>> tsgs enabled. Any suggestion on how to debug / improve much appreciated.
>>
>> Thanks
>> Mateusz
>>
>>
>>
>

Re: accepting large number of inbound TCP connections

Posted by John Plevyak <jp...@gmail.com>.

What is your CPU utilization?  I would think you would be mostly idle in
which case it isn't a problem with the accept thread.  The reason there is
only one accept thread is that in the past more than one has resulted in
lock contention in the OS around the single file descriptor for the accept
port, and the accept thread does nothing but accept() and queue the new
connection on net worker threads.



On Thu, Dec 28, 2017 at 12:58 PM, Mateusz Zajakala <za...@gmail.com>
wrote:

> Hi,
>
> I'm trying to optimize the throughput of ATS 6.2.0 running on 16G / 8
> cores server. ATS handles up to 7 Gbps of traffic (circa 500 requests /
> second) serving up to 80% of traffic from ram-disk-based cache.
>
> The problem I'm seeing is that from time to time my http clients can't
> connect to the server reasonably fast (I define this as 1 second to
> establish TCP conn). Unfortunately, http keep alive is not used by clients,
> so those 500 request / second are all made over new TCP connections.
> Clients connect, retrieve the file and disconnect. I do realize the
> overheads, but this is not something I can easily change (client-side)...
>
> I'm wondering what I can do to improve the performance and eliminate those
> failed connection attempts. Some ideas I have tried
> - 30000 connection throttle in records.config (afaik this also sets the
> max no of open files for ATS)
> - tcp_fin_timeout is set to 1 - I'm not running out of ports because of
> sockets stuck in TIME_WAIT, I have checked. At any given time I have no
> more than 1k TCP connections open
>
> Unfortunately, I'm not sure where these incomng connections are
> dropped/stuck and I'm not sure which TCP stats would help understanding
> this. I have also not tweaked around default Centos 7 TCP settings as I
> don't feel competent enough.
>
> One thing that caught my attention is proxy.config.accept_threads value
> set to 1 (default). This seems really low given the traffic, but I read
> somewhere that it's best left at that. Can you please comment on that?
> Shouldn't this value be adjusted (e.g. 4 or more)? Or even move the accepts
> to worker threads?
>
> I'm not seeing any meaningful errors in ATS logs, but there are no debug
> tsgs enabled. Any suggestion on how to debug / improve much appreciated.
>
> Thanks
> Mateusz
>
>
>