You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@qpid.apache.org by ft420 <ar...@gmail.com> on 2009/06/03 08:14:24 UTC
worker thread with qpidd
hi,
without --worker-thread option pidstat command shows that there are by
default 6 threads created
with --worker-thread 6 option pidstat command shows that there are 9 i.e.
default 6 + 3 threads created.
As per documentation worker threads option is used to improve performance.
I checked with --worker-thread 10 and without --worker-thread.
direct_producer sends 100000 messages put time increases with
--worker-thread 10 as compared to --worler-thread option.
how exactly to improve performance in qpid using threads??
Thanks
--
View this message in context: http://n2.nabble.com/worker-thread-with-qpidd-tp3016587p3016587.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.
---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project: http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org
Re: worker thread with qpidd
Posted by Carl Trieloff <cc...@redhat.com>.
Generally if you set the number of thread larger than the core count
your performance will go down
as expected. However the reason the option is there, is so that if you
pin the process to less than
the number of cores, then the thread count can be adjusted.
On a 2 core machine with client on the same machine, there are not that
many options, as the client
and broker will contend for the resources on the machine.
My employer had done a report with HP, it is to big to mail out to the
list, but here is some
basic setup that was done for that.
regards
Carl.
Throughput (Perftest)
For throughput, perftest is used to drive the broker for this benchmark.
This harness is able to start up multiple producers and consumers in
balanced (n:n) or unbalanced configurations (x:y).
What the test does:
*
creates a control queue
*
starts x:y producers and consumers
*
waits for all processors to signal they are ready
*
controller records a timestamp
*
producers reliably en-queues messages onto the broker as fast as
they can
*
consumers reliably de-queue messages from the broker as fast as
they can
*
once the last message -- which is marked is received, the
controller is signaled
*
controller waits for all complete signals, records timestamp and
calculates rate
The throughput is the calculated as the total number of messages
reliably transferred divided by the time to transfer those messages.
Latency (Latencytest)
For latency, latencytest is used to drive the broker for this benchmark.
This harness is able to produce messages at a specified rate or for a
specified number of messages that are timestamped, sent to the broker,
looped back to client node. The client will report the minimum, maximum,
and average time for a reporting interval when a rate is used, or for
all the messages sent when a count is used.
Tuning & Parameter Settings
For the testing in this paper the systems were not used for any other
purposes. Therefore, the configuration and tuning that is detailed
should be reviewed when other applications along with MRG Messaging.
Processes
For the testing performed the following were disabled (unless specified
otherwise):
SELinux
cpuspeed
irqbalance
haldaemon
yum-updatesd
smartd
setroubleshoot
sendmail
rpcgssd
rpcidmapd
rpcsvcgssd
rhnsd
pcscd
mdmonitor
mcstrans
kdump
isdn
iptables
ip6tables
hplip
hidd
gpm
cups
bluetooth
avahi-daemon
restorecond
auditd
SysCtl
The following kernel parameters were added to //etc/sysctl.conf/.
net.ipv4.conf.default.arp_filter,
net.ipv4.conf.all.arp_filter
1
Only respond to ARP requests on matching interface
net.core.rmem_max,
net.core.wmem_max
8388608
maximum receive/send socket buffer size in bytes
net.core.rmem_default,
net.core.wmem_default
262144
default setting of the socket receive/send buffer in bytes.
net.ipv4.tcp_rmem,
net.ipv4.tcp_wmem
65536
4194304
8388608
Vector of 3 integers: min, default, max
min - minimal size of receive/send buffer used by TCP sockets
default - default size of receive/send buffer used by TCP sockets
max - maximal size of receive/send buffer allowed for automatically
selected receiver buffers for TCP socket
net.core.netdev_max_backlog
10000
Maximum number of packets, queued on the input side, when the interface
receives packets faster than kernel can process them. Applies to
non-NAPI devices only.
net.ipv4.tcp_window_scaling
0
Enable window scaling as defined in RFC1323.
net.ipv4.tcp_mem
262144
4194304
8388608
Vector of 3 integers: low, pressure, high
low - below this number of pages TCP is not bothered about its
memory appetite.
pressure - when amount of memory allocated by TCP exceeds this
number of pages, TCP moderates its memory consumption and enters
memory pressure mode, which is exited when memory consumption
falls under "low".
high - number of pages allowed for queueing by all TCP sockets.
/*Table 1*/
ethtool
Some of the options ethtool allows the operator to change relate to
coalesce and offload settings. However, during experimentation only
changing the ring settings had noticeable effect for throughput testing.
# *ethtool -g eth1 *
Ring parameters for eth1:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 256
RX Mini: 0
RX Jumbo: 0
TX: 256
# *ethtool -G eth1 rx 2048 tx 2048 *
# *ethtool -g eth1 *
Ring parameters for eth1:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 2048
RX Mini: 0
RX Jumbo: 0
TX: 2048
#
ifconfig
ifconfig was used to increase the /maximum transfer unit/ (MTU) to
support jumbo frames and to increase /txqueuelen/ for throughput testing
when these changes has noticeable effect.
# *ifconfig eth1 *
eth1 Link encap:Ethernet HWaddr 00:18:71:EC:02:80
inet addr:192.168.15.96 Bcast:192.168.15.255 Mask:255.255.255.0
inet6 addr: fe80::218:71ff:feec:280/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:8 errors:0 dropped:0 overruns:0 frame:0
TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:480 (480.0 b) TX bytes:594 (594.0 b)
Memory:fdee0000-fdf00000
# *ifconfig eth1 mtu 9000 txqueuelen 2000 *
# *ifconfig eth1 *
eth1 Link encap:Ethernet HWaddr 00:18:71:EC:02:80
inet addr:192.168.15.96 Bcast:192.168.15.255 Mask:255.255.255.0
inet6 addr: fe80::218:71ff:feec:280/64 Scope:Link
UP BROADCAST MULTICAST MTU:9000 Metric:1
RX packets:8 errors:0 dropped:0 overruns:0 frame:0
TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:2000
RX bytes:480 (480.0 b) TX bytes:594 (594.0 b)
Memory:fdee0000-fdf00000
#
CPU affinity
For latency testing, all interrupts from the cores of one CPU socket
were reassigned to other cores. The interrupts for the interconnect
under test were assigned to cores of this vacated socket. The processes
related to the interconnect (e.g. ib_mad, ipoib) were then schedule to
run on the vacated cores. The Qpid daemon was also scheduled to run on
these or a subset of the vacated cores. How latencytest was scheduled
was determined by the results of experiments limiting or not limiting
the latencytest test process to certain cores.
Experiments with perftest show that usually the best performance was
achieved with the affinity settings after a boot and have not been
manipulated.
Interrupts can be directed to be handled by cores. //proc/interrupts/
can be queried to identify the interrupts for devices and the number of
times each CPU/core has handled each interrupt. For each interrupt, a
file named //proc/irq/<IRQ #>/smp_affinity/ contains a hexadecimal mask
which controls which cores can respond to specific interrupt. The
contents of these files can be queried or set.
Processes can be restricted to run on a set of CPUs/cores. taskset can
be used to define the list of CPUs/cores that a be scheduled to execute on.
The MRG -- Realtime product include an applicaiton, tuna, that allows
for easy setting of affinity of interrupts and processes, through a GUI
or command line.
AMQP parameters
Qpid parameters can be specified on the command line, through
environment variables or through the Qpid configuration file.
The tests were run with the following qpidd options:
--auth no
turn of connection authentication, makes setting the test environment easier
--mgmt-enable no
disable the collection of management data
--tcp-nodelay
disable the batching of packets
--worker-threads <#>
set the number of IO worker threads to <#>
This was only used for latency test, where the range use was between 1
and one more than the numbers of cores in a socket.
The default, which was used for throughput, is one more than the total
number of active cores.
/*Table 2*/
*Table 3* details the options which were specified for /perftest/.
For all testing in this paper a /count/ of 200000 was used.
Experimentation was used to detect if setting /tcp-nodelay/ was
beneficial or not. For each /size/ reported, the /npubs/ and
/nsubs/ were set equally from 1 to 8 by powers of 2 while /qt/ was
set between 1 to 16 also by powers of 2. The highest value for
each /size/ is reported.
--nsubs <#>
--npubs <#>
number of publishers/ subscribers per client
--count <#>
number of messages send per pub per qt,
so total messages = count * qt * (npub+nsub)
--qt <#>
number of queues being used
--size <#>
message size
--tcp-nodelay
disable the batching of packets
--protocol <tcp| rdma>
used to specify RDMA, default is TCP
/*Table 3*/
The parameters that were used for /latencytest/ are listed in *Table 4*.
A 10000 message /rate/ was chosen since all the test interconnects would
be able to maintain this rate. When specified, the /max-frame-size/ was
set to 120 more than the size. When a /max-frame-size/ was specified,
/bound-multiplier/ was set to 1.
--rate <#>
target message rate
--size <#>
message size
--max-frame-size <#>
the maximum frame size to request
only specified for ethernet interconnects
--bounds-multiplier <#>
bound size of write queue (as a multiple of the max frame size)
only specified for ethernet interconnects
--tcp-nodelay
disable the batching of packets
--protocol <tcp| rdma>
used to specify RDMA, default is TCP
/*Table 4 */
ft420 wrote:
> exchange used: fanout
> we are running broker on 2 core machine. fanout send client is also running
> on the same windows machine.
> there are 3 recv applications running on three separate machines.
>
> we were trying with-> --worker-thread 9 which gives poor performance
> compared to without --worker-threads option
> now we have taken --worker-threads 2 as no of processors on the machine
> where broker is running is 2. in this case how many threads exactly has to
> be used to so as to improve performance
>
> Thanks
>
>
>
> Gordon Sim wrote:
>
>> ft420 wrote:
>>
>>> hi,
>>>
>>> without --worker-thread option pidstat command shows that there are by
>>> default 6 threads created
>>> with --worker-thread 6 option pidstat command shows that there are 9 i.e.
>>> default 6 + 3 threads created.
>>>
>> Fyi: the extra three threads are timer threads for various different
>> tasks.
>>
>>
>>> As per documentation worker threads option is used to improve
>>> performance.
>>> I checked with --worker-thread 10 and without --worker-thread.
>>> direct_producer sends 100000 messages put time increases with
>>> --worker-thread 10 as compared to --worler-thread option.
>>>
>> Running more threads than there are processors will not improve any real
>> parallelism. There is also no real value from using more threads than
>> you have active connections (so in a test with just one producer and one
>> consumer connection you won't see any benefit from having more than 2
>> worker threads).
>>
>> ---------------------------------------------------------------------
>> Apache Qpid - AMQP Messaging Implementation
>> Project: http://qpid.apache.org
>> Use/Interact: mailto:users-subscribe@qpid.apache.org
>>
>>
>>
>>
>
>
Re: worker thread with qpidd
Posted by Gordon Sim <gs...@redhat.com>.
ft420 wrote:
> exchange used: fanout
> we are running broker on 2 core machine. fanout send client is also running
> on the same windows machine.
> there are 3 recv applications running on three separate machines.
>
> we were trying with-> --worker-thread 9 which gives poor performance
> compared to without --worker-threads option
> now we have taken --worker-threads 2 as no of processors on the machine
> where broker is running is 2. in this case how many threads exactly has to
> be used to so as to improve performance
Best advice would be to run some experiments. My guess is 2 or 3
(default for 2 core machine) worker threads are likely to give the best
performance.
---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project: http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org
Re: worker thread with qpidd
Posted by ft420 <ar...@gmail.com>.
exchange used: fanout
we are running broker on 2 core machine. fanout send client is also running
on the same windows machine.
there are 3 recv applications running on three separate machines.
we were trying with-> --worker-thread 9 which gives poor performance
compared to without --worker-threads option
now we have taken --worker-threads 2 as no of processors on the machine
where broker is running is 2. in this case how many threads exactly has to
be used to so as to improve performance
Thanks
Gordon Sim wrote:
>
> ft420 wrote:
>> hi,
>>
>> without --worker-thread option pidstat command shows that there are by
>> default 6 threads created
>> with --worker-thread 6 option pidstat command shows that there are 9 i.e.
>> default 6 + 3 threads created.
>
> Fyi: the extra three threads are timer threads for various different
> tasks.
>
>> As per documentation worker threads option is used to improve
>> performance.
>> I checked with --worker-thread 10 and without --worker-thread.
>> direct_producer sends 100000 messages put time increases with
>> --worker-thread 10 as compared to --worler-thread option.
>
> Running more threads than there are processors will not improve any real
> parallelism. There is also no real value from using more threads than
> you have active connections (so in a test with just one producer and one
> consumer connection you won't see any benefit from having more than 2
> worker threads).
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project: http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org
>
>
>
--
View this message in context: http://n2.nabble.com/worker-thread-with-qpidd-tp3016587p3017154.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.
---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project: http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org
Re: worker thread with qpidd
Posted by Gordon Sim <gs...@redhat.com>.
ft420 wrote:
> hi,
>
> without --worker-thread option pidstat command shows that there are by
> default 6 threads created
> with --worker-thread 6 option pidstat command shows that there are 9 i.e.
> default 6 + 3 threads created.
Fyi: the extra three threads are timer threads for various different tasks.
> As per documentation worker threads option is used to improve performance.
> I checked with --worker-thread 10 and without --worker-thread.
> direct_producer sends 100000 messages put time increases with
> --worker-thread 10 as compared to --worler-thread option.
Running more threads than there are processors will not improve any real
parallelism. There is also no real value from using more threads than
you have active connections (so in a test with just one producer and one
consumer connection you won't see any benefit from having more than 2
worker threads).
---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project: http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org