You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@trafficserver.apache.org by Pavel Kazlenka <pa...@measurement-factory.com> on 2013/10/16 18:10:34 UTC

Tuning ATS for performance testing.

Hi gentlemen,

I'm trying to test the performance of ATS v.4.0.2.

Server under test has quad-core CPU with HT disabled. During test (1k 
user-agents, 1k origin servers, up to 6k requests per second with 
average size of 8kb) at mark of 2-2.5k requests per second I see the 
signs of overloading (growing delay time, missed responses). The problem 
is that according to top output, CPU cycles are not under heavy load 
(which is strange for overloaded system). All the other parameters (ram, 
I/O, network) are far from saturation too. Top shows load at about 
50-60% of one core for [ET_NET 0] process. traffic_server instances seem 
to be spreaded between all the cores, even if I'm trying to bind them 
mandatory to one or the two of the corec using taskset.

My alterations to default ats configuration (mostly following this 
guide:http://www.ogre.com/node/392):

Cache is fully disabled:
CONFIG proxy.config.http.cache.http INT 0
Threads:
CONFIG proxy.config.exec_thread.autoconfig INT 0
CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1
CONFIG proxy.config.exec_thread.limit INT 4
CONFIG proxy.config.accept_threads INT 2
CONFIG proxy.config.cache.threads_per_disk INT 1
CONFIG proxy.config.task_threads INT 4

So my questions are the next:
1) Is there any known strategy to distribute ATS processes/threads by 
CPU cores? E.g. All the traffic_server threads bind to cpu0 and cpu1, 
all traffic_manager threads to cpu2 and networking interrupts to cpu3?
2) If so, how can this be done? I see some threads ignore 'taskset -a -p 
1,2 <traffic_server pid>' and are being executed on any CPU core. May be 
configuration directives?
3) What is the better strategy for core configuration? Should sum of 
task, accept and network threads be equal to CPU cores number + 1? Or 
anything else? May be it's better to use 40 threads in sum for quad-core 
device?
4) Does *thread* config options are taking in account if 
proxy.config.http.cache.http is set to '1'?
5) What other options should have influence on system performance in 
case of cache-off test?

TIA,
Pavel



Re: Tuning ATS for performance testing.

Posted by Pavel Kazlenka <pa...@measurement-factory.com>.
On 10/22/2013 07:36 PM, James Peach wrote:
> On Oct 22, 2013, at 6:17 AM, Pavel Kazlenka <pa...@measurement-factory.com> wrote:
>
>> Thank you all for your replies. I have new questions here.
>>
>> I'm trying to estimate performance of single ATS 'network' (main) thread (hope I'm using correct term). My thread-related configuration:
>> ONFIG proxy.config.exec_thread.autoconfig INT 0
>> CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1.000000
>> CONFIG proxy.config.exec_thread.limit INT 1
>> CONFIG proxy.config.accept_threads INT 1
>> CONFIG proxy.config.cache.threads_per_disk INT 0
>> CONFIG proxy.config.ssl.number.threads INT 0
>> CONFIG proxy.config.task_threads INT 1
>> #Caching is off:
>> CONFIG proxy.config.http.cache.http INT 0
>>
>> I've found that memory-related options have a great impact on performance:
>>
>> CONFIG proxy.config.thread.default.stacksize INT 536870912
> 512MB stack size for thread? Wow, that seems like an awful lot.
As this is performance testing, I'd like to minimize the performance 
impact of increasing/decreasing stack size during test (if it exists at 
all). Is there any way to obtain current value of stack size by procfs 
of commandline?

>> CONFIG proxy.config.allocator.thread_freelist_size INT 524288
>> CONFIG proxy.config.system.mmap_max INT 5368709120
> This setting ends up calling mallopt(M_MMAP_MAX), see <http://man7.org/linux/man-pages/man3/mallopt.3.html>. From the description in the man page, I'd be surprised if setting this was useful in most cases.
>
>> My problem is that default values for the variables given above are extremely conservative. On the other hand, it's hard to pick up the good values for such a great range. Especially while these options are not really good documented.
>> So my questions are:
>> 1) What should be good values for machine with 6GB RAM for the three variables above (assuming that the only purpose of machine is running single thread of ATS to forward traffic with high rate and minimal delay)?
>> 2) Is there any guide/detailed documentation on these options?
>> 3) May be there's some kind of formulas based on common sense that could help to choose variables values based on proxy load (requests per second)?
> proxy.config.allocator.thread_freelist_size seems like a reasonable setting to tune. This is going to control how much memory will be permanently allocated to per-thread magazines. It's a tuning balance between how much memory you need to allocate to servicing transactions and how much memory should be used for RAM cache and other processes that are running on the box.
>
> FWIW I run my systems with proxy.config.allocator.thread_freelist_size=16K, though my workload and hardware configuration is pretty different from yours.
>
>> Device under test is Ubuntu 12.04 LTS, 64 bit, 6GB RAM, expected load is up to 15k requests per second.
>>
>> TIA,
>> Pavel
>>
>> On 10/17/2013 10:07 PM, Igor Galić wrote:
>>> ----- Original Message -----
>>>> Thank you Igor.
>>>>
>>>> I've rebuilt ATS with hwloc and things became a bit better. Now I see
>>>> that load is being balanced fairly between configured number of threads
>>> Thank you very much for this feedback.
>>> I knew that this code paths have an impact (I've hacked bits of it too)
>>> but since I always compile --with-hwloc, I've never saw the difference.
>>>
>>>
>>> ++i
>>>
>>> Igor Galić
>>>
>>> Tel: +43 (0) 664 886 22 883
>>> Mail: i.galic@brainsware.org
>>> URL: http://brainsware.org/
>>> GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE
>>>


Re: Tuning ATS for performance testing.

Posted by Pavel Kazlenka <pa...@measurement-factory.com>.
On 10/23/2013 07:22 AM, Leif Hedstrom wrote:
> On Oct 22, 2013, at 10:36 AM, James Peach <jp...@apache.org> wrote:
>
>>>
>>> I've found that memory-related options have a great impact on performance:
>>>
>>> CONFIG proxy.config.thread.default.stacksize INT 536870912
>> 512MB stack size for thread? Wow, that seems like an awful lot.
> The history on this setting is long and strange. The defaults were set arbitrarily high to avoid a problem that looked like an out of memory event (on the cache). Increasing this setting also came together with increasing the appropriate sysctl:
>
> 	vm.max_map_count = 2097152
>
>
> It might be time to revisit this again. I filed a Jira on this a while back:
>
> 	https://issues.apache.org/jira/browse/TS-1822
>
> As to why this would reduce lock contention, I have no idea. I'm positive it was added purely to address a false positive "out of memory" problem. Bryan, can you look this up in Bugzilla? Probably from around 1997 or so, by Vladimir.
>
I suspect you are talking about proxy.config.system.mmap_max. Is it safe 
to set ATS config value and kernel setting value to something like 
RAM*vm.swappiness? Could this cause performance or overall system 
instability?

>>> CONFIG proxy.config.allocator.thread_freelist_size INT 524288
> This is incredibly high, but probably fine. The default is 512 objects for each freelist for each thread. We probably should allow for a "-1" to mean unlimited (which the above basically means).
I don't like to be annoying, but can you (or anybody else) bring a bit 
more light on freelist entity itself for not-developer (i.e. I cannot 
look into code). E.g. 'Each thread has his list of objects (responses 
and requests, what else?) (?) that the thread is currently processes. If 
the freelist is full at the moment when request/reply arrives, the 
request is ... (dropped, stored into intermediate place, postponed?). 
The value for this option should be choosen based on average (maximal?) 
request rate (x2 for request and response?) and amount of available 
memory (thread will consume about freelist_size*average_response_size in 
memory?)...'. Is there any way to get at least rough current value of 
objects in freelist (for individual thread or all the threads) during 
ATS execution?

> Cheers,
>
> -- Leif
>
>


Re: Tuning ATS for performance testing.

Posted by Leif Hedstrom <zw...@apache.org>.
On Oct 22, 2013, at 10:36 AM, James Peach <jp...@apache.org> wrote:

> 
>> 
>> 
>> I've found that memory-related options have a great impact on performance:
>> 
>> CONFIG proxy.config.thread.default.stacksize INT 536870912
> 
> 512MB stack size for thread? Wow, that seems like an awful lot.

The history on this setting is long and strange. The defaults were set arbitrarily high to avoid a problem that looked like an out of memory event (on the cache). Increasing this setting also came together with increasing the appropriate sysctl:

	vm.max_map_count = 2097152


It might be time to revisit this again. I filed a Jira on this a while back:

	https://issues.apache.org/jira/browse/TS-1822

As to why this would reduce lock contention, I have no idea. I'm positive it was added purely to address a false positive "out of memory" problem. Bryan, can you look this up in Bugzilla? Probably from around 1997 or so, by Vladimir.

> 
>> CONFIG proxy.config.allocator.thread_freelist_size INT 524288

This is incredibly high, but probably fine. The default is 512 objects for each freelist for each thread. We probably should allow for a "-1" to mean unlimited (which the above basically means).

Cheers,

-- Leif



Re: Tuning ATS for performance testing.

Posted by "Adam W. Dace" <co...@gmail.com>.
Yeah, I've just been trying to optimize what I have without digging into
the codebase too much.

Just FYI, I eventually went with 8MB for proxy.config.system.mmap_max.
Anything higher seemed to affect latency(i.e. page load speed).

So I'm finally done tuning my child/parent proxy setup.  Will move on to
testing file sizes soonish.


On Tue, Oct 22, 2013 at 10:49 PM, James Peach <jp...@apache.org> wrote:

> On Oct 22, 2013, at 1:45 PM, Adam W. Dace <co...@gmail.com>
> wrote:
>
> > Just thought I'd jump in here and report that after seeing his email,
> just for fun, I tried increasing proxy.config.system.mmap_max
> > from 2MB(default) to 8MB...then 16MB(my disk can read/write about
> 50MB/sec so this seems reasonable).
>
> That's really interesting. You could also test using tcmalloc, which has a
> reputation for higher performance than glibc's malloc.
>
> >
> > All of a sudden my Bing Image Searches aren't running into a write lock
> contention issue anymore.
> > I just may end up keeping this setting.  :)
> >
> >
> >
> > On Tue, Oct 22, 2013 at 11:36 AM, James Peach <jp...@apache.org> wrote:
> >
> > On Oct 22, 2013, at 6:17 AM, Pavel Kazlenka <
> pavel.kazlenka@measurement-factory.com> wrote:
> >
> > > Thank you all for your replies. I have new questions here.
> > >
> > > I'm trying to estimate performance of single ATS 'network' (main)
> thread (hope I'm using correct term). My thread-related configuration:
> > > ONFIG proxy.config.exec_thread.autoconfig INT 0
> > > CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1.000000
> > > CONFIG proxy.config.exec_thread.limit INT 1
> > > CONFIG proxy.config.accept_threads INT 1
> > > CONFIG proxy.config.cache.threads_per_disk INT 0
> > > CONFIG proxy.config.ssl.number.threads INT 0
> > > CONFIG proxy.config.task_threads INT 1
> > > #Caching is off:
> > > CONFIG proxy.config.http.cache.http INT 0
> > >
> > > I've found that memory-related options have a great impact on
> performance:
> > >
> > > CONFIG proxy.config.thread.default.stacksize INT 536870912
> >
> > 512MB stack size for thread? Wow, that seems like an awful lot.
> >
> > > CONFIG proxy.config.allocator.thread_freelist_size INT 524288
> > > CONFIG proxy.config.system.mmap_max INT 5368709120
> >
> > This setting ends up calling mallopt(M_MMAP_MAX), see <
> http://man7.org/linux/man-pages/man3/mallopt.3.html>. From the
> description in the man page, I'd be surprised if setting this was useful in
> most cases.
> >
> > >
> > > My problem is that default values for the variables given above are
> extremely conservative. On the other hand, it's hard to pick up the good
> values for such a great range. Especially while these options are not
> really good documented.
> > > So my questions are:
> > > 1) What should be good values for machine with 6GB RAM for the three
> variables above (assuming that the only purpose of machine is running
> single thread of ATS to forward traffic with high rate and minimal delay)?
> > > 2) Is there any guide/detailed documentation on these options?
> > > 3) May be there's some kind of formulas based on common sense that
> could help to choose variables values based on proxy load (requests per
> second)?
> >
> > proxy.config.allocator.thread_freelist_size seems like a reasonable
> setting to tune. This is going to control how much memory will be
> permanently allocated to per-thread magazines. It's a tuning balance
> between how much memory you need to allocate to servicing transactions and
> how much memory should be used for RAM cache and other processes that are
> running on the box.
> >
> > FWIW I run my systems with
> proxy.config.allocator.thread_freelist_size=16K, though my workload and
> hardware configuration is pretty different from yours.
> >
> > > Device under test is Ubuntu 12.04 LTS, 64 bit, 6GB RAM, expected load
> is up to 15k requests per second.
> > >
> > > TIA,
> > > Pavel
> > >
> > > On 10/17/2013 10:07 PM, Igor Galić wrote:
> > >>
> > >> ----- Original Message -----
> > >>> Thank you Igor.
> > >>>
> > >>> I've rebuilt ATS with hwloc and things became a bit better. Now I see
> > >>> that load is being balanced fairly between configured number of
> threads
> > >> Thank you very much for this feedback.
> > >> I knew that this code paths have an impact (I've hacked bits of it
> too)
> > >> but since I always compile --with-hwloc, I've never saw the
> difference.
> > >>
> > >>
> > >> ++i
> > >>
> > >> Igor Galić
> > >>
> > >> Tel: +43 (0) 664 886 22 883
> > >> Mail: i.galic@brainsware.org
> > >> URL: http://brainsware.org/
> > >> GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE
> > >>
> > >
> >
> >
> >
> >
> > --
> > ____________________________________________________________
> > Adam W. Dace <co...@gmail.com>
> >
> > Phone: (815) 355-5848
> > Instant Messenger: AIM & Yahoo! IM - colonelforbin74 | ICQ - #39374451
> > Microsoft Messenger - colonelforbin74@live.com
> >
> > Google Profile: https://plus.google.com/u/0/109309036874332290399/about
>
>


-- 
____________________________________________________________
Adam W. Dace <co...@gmail.com>

Phone: (815) 355-5848
Instant Messenger: AIM & Yahoo! IM - colonelforbin74 | ICQ - #39374451
Microsoft Messenger - colonelforbin74@live.com <ad...@turing.com>

Google Profile: https://plus.google.com/u/0/109309036874332290399/about

Re: Tuning ATS for performance testing.

Posted by James Peach <jp...@apache.org>.
On Oct 22, 2013, at 1:45 PM, Adam W. Dace <co...@gmail.com> wrote:

> Just thought I'd jump in here and report that after seeing his email, just for fun, I tried increasing proxy.config.system.mmap_max
> from 2MB(default) to 8MB...then 16MB(my disk can read/write about 50MB/sec so this seems reasonable).

That's really interesting. You could also test using tcmalloc, which has a reputation for higher performance than glibc's malloc.

> 
> All of a sudden my Bing Image Searches aren't running into a write lock contention issue anymore.
> I just may end up keeping this setting.  :)
> 
> 
> 
> On Tue, Oct 22, 2013 at 11:36 AM, James Peach <jp...@apache.org> wrote:
> 
> On Oct 22, 2013, at 6:17 AM, Pavel Kazlenka <pa...@measurement-factory.com> wrote:
> 
> > Thank you all for your replies. I have new questions here.
> >
> > I'm trying to estimate performance of single ATS 'network' (main) thread (hope I'm using correct term). My thread-related configuration:
> > ONFIG proxy.config.exec_thread.autoconfig INT 0
> > CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1.000000
> > CONFIG proxy.config.exec_thread.limit INT 1
> > CONFIG proxy.config.accept_threads INT 1
> > CONFIG proxy.config.cache.threads_per_disk INT 0
> > CONFIG proxy.config.ssl.number.threads INT 0
> > CONFIG proxy.config.task_threads INT 1
> > #Caching is off:
> > CONFIG proxy.config.http.cache.http INT 0
> >
> > I've found that memory-related options have a great impact on performance:
> >
> > CONFIG proxy.config.thread.default.stacksize INT 536870912
> 
> 512MB stack size for thread? Wow, that seems like an awful lot.
> 
> > CONFIG proxy.config.allocator.thread_freelist_size INT 524288
> > CONFIG proxy.config.system.mmap_max INT 5368709120
> 
> This setting ends up calling mallopt(M_MMAP_MAX), see <http://man7.org/linux/man-pages/man3/mallopt.3.html>. From the description in the man page, I'd be surprised if setting this was useful in most cases.
> 
> >
> > My problem is that default values for the variables given above are extremely conservative. On the other hand, it's hard to pick up the good values for such a great range. Especially while these options are not really good documented.
> > So my questions are:
> > 1) What should be good values for machine with 6GB RAM for the three variables above (assuming that the only purpose of machine is running single thread of ATS to forward traffic with high rate and minimal delay)?
> > 2) Is there any guide/detailed documentation on these options?
> > 3) May be there's some kind of formulas based on common sense that could help to choose variables values based on proxy load (requests per second)?
> 
> proxy.config.allocator.thread_freelist_size seems like a reasonable setting to tune. This is going to control how much memory will be permanently allocated to per-thread magazines. It's a tuning balance between how much memory you need to allocate to servicing transactions and how much memory should be used for RAM cache and other processes that are running on the box.
> 
> FWIW I run my systems with proxy.config.allocator.thread_freelist_size=16K, though my workload and hardware configuration is pretty different from yours.
> 
> > Device under test is Ubuntu 12.04 LTS, 64 bit, 6GB RAM, expected load is up to 15k requests per second.
> >
> > TIA,
> > Pavel
> >
> > On 10/17/2013 10:07 PM, Igor Galić wrote:
> >>
> >> ----- Original Message -----
> >>> Thank you Igor.
> >>>
> >>> I've rebuilt ATS with hwloc and things became a bit better. Now I see
> >>> that load is being balanced fairly between configured number of threads
> >> Thank you very much for this feedback.
> >> I knew that this code paths have an impact (I've hacked bits of it too)
> >> but since I always compile --with-hwloc, I've never saw the difference.
> >>
> >>
> >> ++i
> >>
> >> Igor Galić
> >>
> >> Tel: +43 (0) 664 886 22 883
> >> Mail: i.galic@brainsware.org
> >> URL: http://brainsware.org/
> >> GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE
> >>
> >
> 
> 
> 
> 
> -- 
> ____________________________________________________________
> Adam W. Dace <co...@gmail.com>
> 
> Phone: (815) 355-5848
> Instant Messenger: AIM & Yahoo! IM - colonelforbin74 | ICQ - #39374451
> Microsoft Messenger - colonelforbin74@live.com
> 
> Google Profile: https://plus.google.com/u/0/109309036874332290399/about


Re: Tuning ATS for performance testing.

Posted by "Adam W. Dace" <co...@gmail.com>.
Just thought I'd jump in here and report that after seeing his email, just
for fun, I tried increasing proxy.config.system.mmap_max
from 2MB(default) to 8MB...then 16MB(my disk can read/write about 50MB/sec
so this seems reasonable).

All of a sudden my Bing Image Searches aren't running into a write lock
contention issue anymore.
I just may end up keeping this setting.  :)



On Tue, Oct 22, 2013 at 11:36 AM, James Peach <jp...@apache.org> wrote:

>
> On Oct 22, 2013, at 6:17 AM, Pavel Kazlenka <
> pavel.kazlenka@measurement-factory.com> wrote:
>
> > Thank you all for your replies. I have new questions here.
> >
> > I'm trying to estimate performance of single ATS 'network' (main) thread
> (hope I'm using correct term). My thread-related configuration:
> > ONFIG proxy.config.exec_thread.autoconfig INT 0
> > CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1.000000
> > CONFIG proxy.config.exec_thread.limit INT 1
> > CONFIG proxy.config.accept_threads INT 1
> > CONFIG proxy.config.cache.threads_per_disk INT 0
> > CONFIG proxy.config.ssl.number.threads INT 0
> > CONFIG proxy.config.task_threads INT 1
> > #Caching is off:
> > CONFIG proxy.config.http.cache.http INT 0
> >
> > I've found that memory-related options have a great impact on
> performance:
> >
> > CONFIG proxy.config.thread.default.stacksize INT 536870912
>
> 512MB stack size for thread? Wow, that seems like an awful lot.
>
> > CONFIG proxy.config.allocator.thread_freelist_size INT 524288
> > CONFIG proxy.config.system.mmap_max INT 5368709120
>
> This setting ends up calling mallopt(M_MMAP_MAX), see <
> http://man7.org/linux/man-pages/man3/mallopt.3.html>. From the
> description in the man page, I'd be surprised if setting this was useful in
> most cases.
>
> >
> > My problem is that default values for the variables given above are
> extremely conservative. On the other hand, it's hard to pick up the good
> values for such a great range. Especially while these options are not
> really good documented.
> > So my questions are:
> > 1) What should be good values for machine with 6GB RAM for the three
> variables above (assuming that the only purpose of machine is running
> single thread of ATS to forward traffic with high rate and minimal delay)?
> > 2) Is there any guide/detailed documentation on these options?
> > 3) May be there's some kind of formulas based on common sense that could
> help to choose variables values based on proxy load (requests per second)?
>
> proxy.config.allocator.thread_freelist_size seems like a reasonable
> setting to tune. This is going to control how much memory will be
> permanently allocated to per-thread magazines. It's a tuning balance
> between how much memory you need to allocate to servicing transactions and
> how much memory should be used for RAM cache and other processes that are
> running on the box.
>
> FWIW I run my systems with
> proxy.config.allocator.thread_freelist_size=16K, though my workload and
> hardware configuration is pretty different from yours.
>
> > Device under test is Ubuntu 12.04 LTS, 64 bit, 6GB RAM, expected load is
> up to 15k requests per second.
> >
> > TIA,
> > Pavel
> >
> > On 10/17/2013 10:07 PM, Igor Galić wrote:
> >>
> >> ----- Original Message -----
> >>> Thank you Igor.
> >>>
> >>> I've rebuilt ATS with hwloc and things became a bit better. Now I see
> >>> that load is being balanced fairly between configured number of threads
> >> Thank you very much for this feedback.
> >> I knew that this code paths have an impact (I've hacked bits of it too)
> >> but since I always compile --with-hwloc, I've never saw the difference.
> >>
> >>
> >> ++i
> >>
> >> Igor Galić
> >>
> >> Tel: +43 (0) 664 886 22 883
> >> Mail: i.galic@brainsware.org
> >> URL: http://brainsware.org/
> >> GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE
> >>
> >
>
>


-- 
____________________________________________________________
Adam W. Dace <co...@gmail.com>

Phone: (815) 355-5848
Instant Messenger: AIM & Yahoo! IM - colonelforbin74 | ICQ - #39374451
Microsoft Messenger - colonelforbin74@live.com <ad...@turing.com>

Google Profile: https://plus.google.com/u/0/109309036874332290399/about

Re: Tuning ATS for performance testing.

Posted by James Peach <jp...@apache.org>.
On Oct 22, 2013, at 6:17 AM, Pavel Kazlenka <pa...@measurement-factory.com> wrote:

> Thank you all for your replies. I have new questions here.
> 
> I'm trying to estimate performance of single ATS 'network' (main) thread (hope I'm using correct term). My thread-related configuration:
> ONFIG proxy.config.exec_thread.autoconfig INT 0
> CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1.000000
> CONFIG proxy.config.exec_thread.limit INT 1
> CONFIG proxy.config.accept_threads INT 1
> CONFIG proxy.config.cache.threads_per_disk INT 0
> CONFIG proxy.config.ssl.number.threads INT 0
> CONFIG proxy.config.task_threads INT 1
> #Caching is off:
> CONFIG proxy.config.http.cache.http INT 0
> 
> I've found that memory-related options have a great impact on performance:
> 
> CONFIG proxy.config.thread.default.stacksize INT 536870912

512MB stack size for thread? Wow, that seems like an awful lot.

> CONFIG proxy.config.allocator.thread_freelist_size INT 524288
> CONFIG proxy.config.system.mmap_max INT 5368709120

This setting ends up calling mallopt(M_MMAP_MAX), see <http://man7.org/linux/man-pages/man3/mallopt.3.html>. From the description in the man page, I'd be surprised if setting this was useful in most cases.

> 
> My problem is that default values for the variables given above are extremely conservative. On the other hand, it's hard to pick up the good values for such a great range. Especially while these options are not really good documented.
> So my questions are:
> 1) What should be good values for machine with 6GB RAM for the three variables above (assuming that the only purpose of machine is running single thread of ATS to forward traffic with high rate and minimal delay)?
> 2) Is there any guide/detailed documentation on these options?
> 3) May be there's some kind of formulas based on common sense that could help to choose variables values based on proxy load (requests per second)?

proxy.config.allocator.thread_freelist_size seems like a reasonable setting to tune. This is going to control how much memory will be permanently allocated to per-thread magazines. It's a tuning balance between how much memory you need to allocate to servicing transactions and how much memory should be used for RAM cache and other processes that are running on the box.

FWIW I run my systems with proxy.config.allocator.thread_freelist_size=16K, though my workload and hardware configuration is pretty different from yours.

> Device under test is Ubuntu 12.04 LTS, 64 bit, 6GB RAM, expected load is up to 15k requests per second.
> 
> TIA,
> Pavel
> 
> On 10/17/2013 10:07 PM, Igor Galić wrote:
>> 
>> ----- Original Message -----
>>> Thank you Igor.
>>> 
>>> I've rebuilt ATS with hwloc and things became a bit better. Now I see
>>> that load is being balanced fairly between configured number of threads
>> Thank you very much for this feedback.
>> I knew that this code paths have an impact (I've hacked bits of it too)
>> but since I always compile --with-hwloc, I've never saw the difference.
>> 
>> 
>> ++i
>> 
>> Igor Galić
>> 
>> Tel: +43 (0) 664 886 22 883
>> Mail: i.galic@brainsware.org
>> URL: http://brainsware.org/
>> GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE
>> 
> 


Re: Tuning ATS for performance testing.

Posted by Pavel Kazlenka <pa...@measurement-factory.com>.
Thank you all for your replies. I have new questions here.

I'm trying to estimate performance of single ATS 'network' (main) thread 
(hope I'm using correct term). My thread-related configuration:
ONFIG proxy.config.exec_thread.autoconfig INT 0
CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1.000000
CONFIG proxy.config.exec_thread.limit INT 1
CONFIG proxy.config.accept_threads INT 1
CONFIG proxy.config.cache.threads_per_disk INT 0
CONFIG proxy.config.ssl.number.threads INT 0
CONFIG proxy.config.task_threads INT 1
#Caching is off:
CONFIG proxy.config.http.cache.http INT 0

I've found that memory-related options have a great impact on performance:

CONFIG proxy.config.thread.default.stacksize INT 536870912
CONFIG proxy.config.allocator.thread_freelist_size INT 524288
CONFIG proxy.config.system.mmap_max INT 5368709120

My problem is that default values for the variables given above are 
extremely conservative. On the other hand, it's hard to pick up the good 
values for such a great range. Especially while these options are not 
really good documented.
So my questions are:
1) What should be good values for machine with 6GB RAM for the three 
variables above (assuming that the only purpose of machine is running 
single thread of ATS to forward traffic with high rate and minimal delay)?
2) Is there any guide/detailed documentation on these options?
3) May be there's some kind of formulas based on common sense that could 
help to choose variables values based on proxy load (requests per second)?

Device under test is Ubuntu 12.04 LTS, 64 bit, 6GB RAM, expected load is 
up to 15k requests per second.

TIA,
Pavel

On 10/17/2013 10:07 PM, Igor Galić wrote:
>
> ----- Original Message -----
>> Thank you Igor.
>>
>> I've rebuilt ATS with hwloc and things became a bit better. Now I see
>> that load is being balanced fairly between configured number of threads
> Thank you very much for this feedback.
> I knew that this code paths have an impact (I've hacked bits of it too)
> but since I always compile --with-hwloc, I've never saw the difference.
>
>
> ++i
>
> Igor Galić
>
> Tel: +43 (0) 664 886 22 883
> Mail: i.galic@brainsware.org
> URL: http://brainsware.org/
> GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE
>


Re: Tuning ATS for performance testing.

Posted by Igor Galić <i....@brainsware.org>.

----- Original Message -----
> Thank you Igor.
> 
> I've rebuilt ATS with hwloc and things became a bit better. Now I see
> that load is being balanced fairly between configured number of threads

Thank you very much for this feedback.
I knew that this code paths have an impact (I've hacked bits of it too)
but since I always compile --with-hwloc, I've never saw the difference.


++i

Igor Galić

Tel: +43 (0) 664 886 22 883
Mail: i.galic@brainsware.org
URL: http://brainsware.org/
GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE


Re: Tuning ATS for performance testing.

Posted by Leif Hedstrom <zw...@apache.org>.
On Oct 17, 2013, at 11:15 AM, Pavel Kazlenka <pa...@measurement-factory.com> wrote:

> Thank you Igor.
> 
> I've rebuilt ATS with hwloc and things became a bit better. Now I see that load is being balanced fairly between configured number of threads during the test. If I understood correctly, this number is either exec_thread.autoconfig.scale * number of cpu cores if exec_thread.autoconfig is set to '1' or exec_thread.limit in opposite case.
> 
> Also I found that there should be about 30 threads to keep load of 4-5k requests per second. But there are still many CPU cycles left (threads take about 80-90% CPU of 400% available).
> One of the possible problems I observe is that threads are migrating between CPU cores. Each time such a migration occurs, thread receives performance penalty that is not good. Is there any mechanism to bind thread to cpu core at startup?

CONFIG proxy.config.exec_thread.affinity INT 1

-- leif


Re: Tuning ATS for performance testing.

Posted by Pavel Kazlenka <pa...@measurement-factory.com>.
Thank you Igor.

I've rebuilt ATS with hwloc and things became a bit better. Now I see 
that load is being balanced fairly between configured number of threads 
during the test. If I understood correctly, this number is either 
exec_thread.autoconfig.scale * number of cpu cores if 
exec_thread.autoconfig is set to '1' or exec_thread.limit in opposite case.

Also I found that there should be about 30 threads to keep load of 4-5k 
requests per second. But there are still many CPU cycles left (threads 
take about 80-90% CPU of 400% available).
One of the possible problems I observe is that threads are migrating 
between CPU cores. Each time such a migration occurs, thread receives 
performance penalty that is not good. Is there any mechanism to bind 
thread to cpu core at startup?

Also, may be there are another useful libraries ATS should be built 
against for better performance? My config.log can be found here:  
http://pastebin.com/u5sV61rR

TIA,
Pavel

On 10/17/2013 04:22 PM, Igor Galić wrote:
>
> ----- Original Message -----
>> On 10/17/2013 12:30 AM, Igor Galić wrote:
>>> ----- Original Message -----
>>>> Hi gentlemen,
>>> Hi Pavel,
>>>    
>>>> I'm trying to test the performance of ATS v.4.0.2.
>>>>
>>>> Server under test has quad-core CPU with HT disabled. During test (1k
>>> Given the range of Platforms we support it's /always/ Good to explicitly
>>> state which platform (OS, version, kernel) you're running on.
>>>
>>> But also, exactly how you compiled it.
>> It's ubuntu 12.04 LTS (32 bit). ATS is configured with default options
>> (except of --prefix).
> there is no such thing as default, when everything is being discovered ;)
> Are you compiling with hwloc? (And if not, can you try to do, and report
> how it changes the behaviour.)
>
>>>> user-agents, 1k origin servers, up to 6k requests per second with
>>>> average size of 8kb) at mark of 2-2.5k requests per second I see the
>>> Given the range of configurations we support it's always good to explictly
>>> state if this is a forward, reverse or transparent proxy (You only mention
>>> later that caching is fully disabled..)
>> Right. This is forward proxy case with reverse proxy mode explicitly
>> disabled.
>>>> signs of overloading (growing delay time, missed responses). The problem
>>>> is that according to top output, CPU cycles are not under heavy load
>>>> (which is strange for overloaded system). All the other parameters (ram,
>>>> I/O, network) are far from saturation too. Top shows load at about
>>>> 50-60% of one core for [ET_NET 0] process. traffic_server instances seem
>>>> to be spreaded between all the cores, even if I'm trying to bind them
>>>> mandatory to one or the two of the corec using taskset.
>>> (at this point I can now guess with certainty that you're talking about
>>> Linux, but I still don't know which distro/version, etc..)
>>>    
>>>> My alterations to default ats configuration (mostly following this
>>>> guide:http://www.ogre.com/node/392):
>>>>
>>>> Cache is fully disabled:
>>>> CONFIG proxy.config.http.cache.http INT 0
>>>> Threads:
>>>> CONFIG proxy.config.exec_thread.autoconfig INT 0
>>>> CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1
>>>> CONFIG proxy.config.exec_thread.limit INT 4
>>>> CONFIG proxy.config.accept_threads INT 2
>>>> CONFIG proxy.config.cache.threads_per_disk INT 1
>>>> CONFIG proxy.config.task_threads INT 4
>>>>
>>>> So my questions are the next:
>>>> 1) Is there any known strategy to distribute ATS processes/threads by
>>>> CPU cores? E.g. All the traffic_server threads bind to cpu0 and cpu1,
>>>> all traffic_manager threads to cpu2 and networking interrupts to cpu3?
>>>> 2) If so, how can this be done? I see some threads ignore 'taskset -a -p
>>>> 1,2 <traffic_server pid>' and are being executed on any CPU core. May be
>>>> configuration directives?
>>>> 3) What is the better strategy for core configuration? Should sum of
>>>> task, accept and network threads be equal to CPU cores number + 1? Or
>>>> anything else? May be it's better to use 40 threads in sum for quad-core
>>>> device?
>>>> 4) Does *thread* config options are taking in account if
>>>> proxy.config.http.cache.http is set to '1'?
>> Here I copied wrong option. I meant
>> 'proxy.config.exec_thread.autoconfig' set to '1'
>>>> 5) What other options should have influence on system performance in
>>>> case of cache-off test?
>>>>
>>>> TIA,
>>>> Pavel
>>>>
>>>>
>>>>
>>


Re: Tuning ATS for performance testing.

Posted by Igor Galić <i....@brainsware.org>.

----- Original Message -----
> On 10/17/2013 12:30 AM, Igor Galić wrote:
> >
> > ----- Original Message -----
> >> Hi gentlemen,
> > Hi Pavel,
> >   
> >> I'm trying to test the performance of ATS v.4.0.2.
> >>
> >> Server under test has quad-core CPU with HT disabled. During test (1k
> > Given the range of Platforms we support it's /always/ Good to explicitly
> > state which platform (OS, version, kernel) you're running on.
> >
> > But also, exactly how you compiled it.
> 
> It's ubuntu 12.04 LTS (32 bit). ATS is configured with default options
> (except of --prefix).

there is no such thing as default, when everything is being discovered ;)
Are you compiling with hwloc? (And if not, can you try to do, and report
how it changes the behaviour.)

> >
> >> user-agents, 1k origin servers, up to 6k requests per second with
> >> average size of 8kb) at mark of 2-2.5k requests per second I see the
> > Given the range of configurations we support it's always good to explictly
> > state if this is a forward, reverse or transparent proxy (You only mention
> > later that caching is fully disabled..)
> Right. This is forward proxy case with reverse proxy mode explicitly
> disabled.
> >
> >> signs of overloading (growing delay time, missed responses). The problem
> >> is that according to top output, CPU cycles are not under heavy load
> >> (which is strange for overloaded system). All the other parameters (ram,
> >> I/O, network) are far from saturation too. Top shows load at about
> >> 50-60% of one core for [ET_NET 0] process. traffic_server instances seem
> >> to be spreaded between all the cores, even if I'm trying to bind them
> >> mandatory to one or the two of the corec using taskset.
> > (at this point I can now guess with certainty that you're talking about
> > Linux, but I still don't know which distro/version, etc..)
> >   
> >> My alterations to default ats configuration (mostly following this
> >> guide:http://www.ogre.com/node/392):
> >>
> >> Cache is fully disabled:
> >> CONFIG proxy.config.http.cache.http INT 0
> >> Threads:
> >> CONFIG proxy.config.exec_thread.autoconfig INT 0
> >> CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1
> >> CONFIG proxy.config.exec_thread.limit INT 4
> >> CONFIG proxy.config.accept_threads INT 2
> >> CONFIG proxy.config.cache.threads_per_disk INT 1
> >> CONFIG proxy.config.task_threads INT 4
> >>
> >> So my questions are the next:
> >> 1) Is there any known strategy to distribute ATS processes/threads by
> >> CPU cores? E.g. All the traffic_server threads bind to cpu0 and cpu1,
> >> all traffic_manager threads to cpu2 and networking interrupts to cpu3?
> >> 2) If so, how can this be done? I see some threads ignore 'taskset -a -p
> >> 1,2 <traffic_server pid>' and are being executed on any CPU core. May be
> >> configuration directives?
> >> 3) What is the better strategy for core configuration? Should sum of
> >> task, accept and network threads be equal to CPU cores number + 1? Or
> >> anything else? May be it's better to use 40 threads in sum for quad-core
> >> device?
> >> 4) Does *thread* config options are taking in account if
> >> proxy.config.http.cache.http is set to '1'?
> Here I copied wrong option. I meant
> 'proxy.config.exec_thread.autoconfig' set to '1'
> >> 5) What other options should have influence on system performance in
> >> case of cache-off test?
> >>
> >> TIA,
> >> Pavel
> >>
> >>
> >>
> 
> 

-- 
Igor Galić

Tel: +43 (0) 664 886 22 883
Mail: i.galic@brainsware.org
URL: http://brainsware.org/
GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE


Re: Tuning ATS for performance testing.

Posted by Pavel Kazlenka <pa...@measurement-factory.com>.
On 10/17/2013 12:30 AM, Igor Galić wrote:
>
> ----- Original Message -----
>> Hi gentlemen,
> Hi Pavel,
>   
>> I'm trying to test the performance of ATS v.4.0.2.
>>
>> Server under test has quad-core CPU with HT disabled. During test (1k
> Given the range of Platforms we support it's /always/ Good to explicitly
> state which platform (OS, version, kernel) you're running on.
>
> But also, exactly how you compiled it.

It's ubuntu 12.04 LTS (32 bit). ATS is configured with default options 
(except of --prefix).
>
>> user-agents, 1k origin servers, up to 6k requests per second with
>> average size of 8kb) at mark of 2-2.5k requests per second I see the
> Given the range of configurations we support it's always good to explictly
> state if this is a forward, reverse or transparent proxy (You only mention
> later that caching is fully disabled..)
Right. This is forward proxy case with reverse proxy mode explicitly 
disabled.
>
>> signs of overloading (growing delay time, missed responses). The problem
>> is that according to top output, CPU cycles are not under heavy load
>> (which is strange for overloaded system). All the other parameters (ram,
>> I/O, network) are far from saturation too. Top shows load at about
>> 50-60% of one core for [ET_NET 0] process. traffic_server instances seem
>> to be spreaded between all the cores, even if I'm trying to bind them
>> mandatory to one or the two of the corec using taskset.
> (at this point I can now guess with certainty that you're talking about
> Linux, but I still don't know which distro/version, etc..)
>   
>> My alterations to default ats configuration (mostly following this
>> guide:http://www.ogre.com/node/392):
>>
>> Cache is fully disabled:
>> CONFIG proxy.config.http.cache.http INT 0
>> Threads:
>> CONFIG proxy.config.exec_thread.autoconfig INT 0
>> CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1
>> CONFIG proxy.config.exec_thread.limit INT 4
>> CONFIG proxy.config.accept_threads INT 2
>> CONFIG proxy.config.cache.threads_per_disk INT 1
>> CONFIG proxy.config.task_threads INT 4
>>
>> So my questions are the next:
>> 1) Is there any known strategy to distribute ATS processes/threads by
>> CPU cores? E.g. All the traffic_server threads bind to cpu0 and cpu1,
>> all traffic_manager threads to cpu2 and networking interrupts to cpu3?
>> 2) If so, how can this be done? I see some threads ignore 'taskset -a -p
>> 1,2 <traffic_server pid>' and are being executed on any CPU core. May be
>> configuration directives?
>> 3) What is the better strategy for core configuration? Should sum of
>> task, accept and network threads be equal to CPU cores number + 1? Or
>> anything else? May be it's better to use 40 threads in sum for quad-core
>> device?
>> 4) Does *thread* config options are taking in account if
>> proxy.config.http.cache.http is set to '1'?
Here I copied wrong option. I meant 
'proxy.config.exec_thread.autoconfig' set to '1'
>> 5) What other options should have influence on system performance in
>> case of cache-off test?
>>
>> TIA,
>> Pavel
>>
>>
>>


Re: Tuning ATS for performance testing.

Posted by Igor Galić <i....@brainsware.org>.

----- Original Message -----
> Hi gentlemen,

Hi Pavel,
 
> I'm trying to test the performance of ATS v.4.0.2.
> 
> Server under test has quad-core CPU with HT disabled. During test (1k

Given the range of Platforms we support it's /always/ Good to explicitly
state which platform (OS, version, kernel) you're running on.

But also, exactly how you compiled it.

> user-agents, 1k origin servers, up to 6k requests per second with
> average size of 8kb) at mark of 2-2.5k requests per second I see the

Given the range of configurations we support it's always good to explictly
state if this is a forward, reverse or transparent proxy (You only mention
later that caching is fully disabled..)

> signs of overloading (growing delay time, missed responses). The problem
> is that according to top output, CPU cycles are not under heavy load
> (which is strange for overloaded system). All the other parameters (ram,
> I/O, network) are far from saturation too. Top shows load at about
> 50-60% of one core for [ET_NET 0] process. traffic_server instances seem
> to be spreaded between all the cores, even if I'm trying to bind them
> mandatory to one or the two of the corec using taskset.

(at this point I can now guess with certainty that you're talking about
Linux, but I still don't know which distro/version, etc..)
 
> My alterations to default ats configuration (mostly following this
> guide:http://www.ogre.com/node/392):
> 
> Cache is fully disabled:
> CONFIG proxy.config.http.cache.http INT 0
> Threads:
> CONFIG proxy.config.exec_thread.autoconfig INT 0
> CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1
> CONFIG proxy.config.exec_thread.limit INT 4
> CONFIG proxy.config.accept_threads INT 2
> CONFIG proxy.config.cache.threads_per_disk INT 1
> CONFIG proxy.config.task_threads INT 4
> 
> So my questions are the next:
> 1) Is there any known strategy to distribute ATS processes/threads by
> CPU cores? E.g. All the traffic_server threads bind to cpu0 and cpu1,
> all traffic_manager threads to cpu2 and networking interrupts to cpu3?
> 2) If so, how can this be done? I see some threads ignore 'taskset -a -p
> 1,2 <traffic_server pid>' and are being executed on any CPU core. May be
> configuration directives?
> 3) What is the better strategy for core configuration? Should sum of
> task, accept and network threads be equal to CPU cores number + 1? Or
> anything else? May be it's better to use 40 threads in sum for quad-core
> device?
> 4) Does *thread* config options are taking in account if
> proxy.config.http.cache.http is set to '1'?
> 5) What other options should have influence on system performance in
> case of cache-off test?
> 
> TIA,
> Pavel
> 
> 
> 

-- 
Igor Galić

Tel: +43 (0) 664 886 22 883
Mail: i.galic@brainsware.org
URL: http://brainsware.org/
GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE