You are viewing a plain text version of this content. The canonical link for it is here.

Posted to proton@qpid.apache.org by Michael Goulish <mg...@redhat.com> on 2014/09/03 09:51:59 UTC

Proton Performance Pictures (1 of 2)

[ resend :  I am attaching only 1 image here, so hopefully the apache 
mail gadget will not become upset.  Next one in next email. ]



Attached, please find two cool pictures of the valgrind/callgrind data
I got with a test run of the psend and precv clients I mentioned before.


( Sorry, I keep saying 'clients'.  These are pure Peer-to-Peer. )
( Hey -- if we ever sell this technology to maritime transport companies,
could we call it "Pier-to-Pier" ? )



This was from a run of 100,000 messages, using credit strategy of
200, 100, 100.
i.e. start at 200, every time you get down to 100, add 100.

That point is where I seem to find the best performance on my
system: 123,500 messages per second received.  ( i.e. 247,000
transfers per second ) using about 180% CPU ( i.e. 90% each of
2 processors. )

By the way, I actually got repeatably better performance (maybe
1.5% better  (which resulted in the 123,500 number)) by using processors
1 and 3 on my laptop, rather than 1 and 2.   Looking at /proc/cpuinfo,
I see that processors 1 and 3 have different core IDs.  OK, whatever.
( And it's an Intel system... )


I think there are no shockers here:
  psend uses its time in pn_post_transfer_frame  (44%)
  precv uses its time in pn_dispatch_frame       (67%)


The code is at https://github.com/mick-goulish/proton_c_clients.git

I will put all this performance info in there too, shortly.

Re: Proton Performance Pictures (1 of 2)

Posted by Alan Conway <ac...@redhat.com>.

On Thu, 2014-09-04 at 02:35 +0200, Leon Mlakar wrote:
> On 04/09/14 01:34, Alan Conway wrote:
> > On Thu, 2014-09-04 at 00:38 +0200, Leon Mlakar wrote:
> >> On 04/09/14 00:25, Alan Conway wrote:
> >>> On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote:
> >>>> OK -- I just had a quick talk with Ted, and this makes sense
> >>>> to me now:
> >>>>
> >>>>     count *receives* per second.
> >>>>
> >>>> I had it turned around and was worried about *sends* per second,
> >>>> and then got confused by issues of fanout.
> >>>>
> >>>> If you only count *receives* per second, and assume no discards,
> >>>> it seems to me that you can indeed make a fair speed comparison
> >>>> between
> >>>>
> >>>>      sender --> receiver
> >>>>
> >>>>      sender --> intermediary --> receiver
> >>>>
> >>>> and
> >>>>
> >>>>      sender --> intermediary --> {receiver_1 ... receiver_n}
> >>>>
> >>>> and even
> >>>>
> >>>>      sender --> {arbitrary network of intermediaries} --> {receiver_1 ... receiver_n}
> >>>>
> >>>> phew.
> >>>>
> >>>>
> >>>> So I will do it that way.
> >>> That's right for throughput, but don't forget latency. A well behaved
> >>> intermediary should have little effect on throughput but will inevitably
> >>> add latency.
> >>>
> >>> Measuring latency between hosts is a pain. You can time-stamp messages
> >>> at the origin host but clock differences can give you bogus numbers if
> >>> you compare that to the time on a different host when the messages
> >>> arrive. One trick is to have the messages arrive back at the same host
> >>> where you time-stamped them (even if they pass thru other hosts in
> >>> between) but that isn't always what you really want to measure. Maybe
> >>> there's something to be done with NNTP, I've never dug into that. Have
> >>> fun!
> >>>
> >> To get a reasonably good estimate of the time difference between sender
> >> an receiver, one could exchange several timestamped messages, w/o
> >> intermediary, in both directions and get both sides to agree on the
> >> difference between them. Do that before the test, and then repeat the
> >> exchange at the end of the test to check for the drift. This of course
> >> assumes stable network latencies during these exchanges and is usable
> >> only in test environments. Exchanging several messages instead of just
> >> one should help eliminating sporadic instabilities.
> >>
> > As I understand it that's pretty much what NTP does.
> > http://en.wikipedia.org/wiki/Network_Time_Protocol says that NTP "can
> > achieve better than one millisecond accuracy in local area networks
> > under ideal conditions." That doesn't sound good enough to measure
> > sub-millisecond latencies. I doubt that a home grown attempt at timing
> > message exchanges will do better than NTP :( NTP may deserve further
> > investigation however, Wikipedia probably makes some very broad
> > assumptions about what your "ideal network conditions" are, its possible
> > that it can be tuned better than that.
> >
> > I can easily get sub-millisecond round-trip latencies out of Qpid with a
> > short message burst:
> > qpidd --auth=no --tcp-nodelay
> > qpid-cpp-benchmark --connection-options '{tcp-nodelay:true}' -q1 -m100
> > send-tp recv-tp l-min   l-max   l-avg   total-tp
> > 38816   30370   0.21    1.18    0.70    3943
> >
> > Sadly if I tell qpidd to use AMQP 1.0 (and therefore proton), things
> > degenerate very badly from a latency perspective.
> > qpid-cpp-benchmark --connection-options '{tcp-nodelay:true,protocol:amqp1.0}' -q1 -m100
> > send-tp recv-tp l-min   l-max   l-avg   total-tp
> > 26086   19552   3.13    6.65    5.28    913
> > 	
> > However this may not be protons fault, the problem could be in qpidd's
> > AMQP1.0 adapter layer. I'm glad to see that we're starting to measure
> > these things for proton and dispatch, that will surely lead to
> > improvement.
> >
> Yes, you are correct, that's basically what NTP does ... and neither 
> will work well with sub-millisecond ranges. I didn't realize that this 
> is what you are after.

It depends a lot on what kind of system you are measuring but fine
tuning the qpid/proton/dispatch tool-set involves some (hopefully!)
pretty low latencies. Even in relatively complicated cases (my past pain
is clustered qpidd for low-latency applications) you may be measuring
3-4ms, but still 1ms error bars are too big.

> 
> There is a beast called 
> http://en.wikipedia.org/wiki/Precision_Time_Protocol, though. A year ago 
> we took a brief look into this but concluded that millisecond accuracy 
> was good enough and that it was not worth the effort.
> 
Thanks that's interesting!

> And of course, it is also possible to attach a GPS receiver to both 
> sending and receiving host. With decent drivers this should provide at 
> least microsecond accuracy.

That is interesting also! But complicated. This is why I end up just
sending the message back to the host of origin and dividing by 2, even
if it's not really the right answer. I usually only care if its better
or worse than the previous build anyway :)

Cheers,
Alan.

Re: Proton Performance Pictures (1 of 2)

Posted by Leon Mlakar <le...@digiverse.si>.

On 04/09/14 01:34, Alan Conway wrote:
> On Thu, 2014-09-04 at 00:38 +0200, Leon Mlakar wrote:
>> On 04/09/14 00:25, Alan Conway wrote:
>>> On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote:
>>>> OK -- I just had a quick talk with Ted, and this makes sense
>>>> to me now:
>>>>
>>>>     count *receives* per second.
>>>>
>>>> I had it turned around and was worried about *sends* per second,
>>>> and then got confused by issues of fanout.
>>>>
>>>> If you only count *receives* per second, and assume no discards,
>>>> it seems to me that you can indeed make a fair speed comparison
>>>> between
>>>>
>>>>      sender --> receiver
>>>>
>>>>      sender --> intermediary --> receiver
>>>>
>>>> and
>>>>
>>>>      sender --> intermediary --> {receiver_1 ... receiver_n}
>>>>
>>>> and even
>>>>
>>>>      sender --> {arbitrary network of intermediaries} --> {receiver_1 ... receiver_n}
>>>>
>>>> phew.
>>>>
>>>>
>>>> So I will do it that way.
>>> That's right for throughput, but don't forget latency. A well behaved
>>> intermediary should have little effect on throughput but will inevitably
>>> add latency.
>>>
>>> Measuring latency between hosts is a pain. You can time-stamp messages
>>> at the origin host but clock differences can give you bogus numbers if
>>> you compare that to the time on a different host when the messages
>>> arrive. One trick is to have the messages arrive back at the same host
>>> where you time-stamped them (even if they pass thru other hosts in
>>> between) but that isn't always what you really want to measure. Maybe
>>> there's something to be done with NNTP, I've never dug into that. Have
>>> fun!
>>>
>> To get a reasonably good estimate of the time difference between sender
>> an receiver, one could exchange several timestamped messages, w/o
>> intermediary, in both directions and get both sides to agree on the
>> difference between them. Do that before the test, and then repeat the
>> exchange at the end of the test to check for the drift. This of course
>> assumes stable network latencies during these exchanges and is usable
>> only in test environments. Exchanging several messages instead of just
>> one should help eliminating sporadic instabilities.
>>
> As I understand it that's pretty much what NTP does.
> http://en.wikipedia.org/wiki/Network_Time_Protocol says that NTP "can
> achieve better than one millisecond accuracy in local area networks
> under ideal conditions." That doesn't sound good enough to measure
> sub-millisecond latencies. I doubt that a home grown attempt at timing
> message exchanges will do better than NTP :( NTP may deserve further
> investigation however, Wikipedia probably makes some very broad
> assumptions about what your "ideal network conditions" are, its possible
> that it can be tuned better than that.
>
> I can easily get sub-millisecond round-trip latencies out of Qpid with a
> short message burst:
> qpidd --auth=no --tcp-nodelay
> qpid-cpp-benchmark --connection-options '{tcp-nodelay:true}' -q1 -m100
> send-tp recv-tp l-min   l-max   l-avg   total-tp
> 38816   30370   0.21    1.18    0.70    3943
>
> Sadly if I tell qpidd to use AMQP 1.0 (and therefore proton), things
> degenerate very badly from a latency perspective.
> qpid-cpp-benchmark --connection-options '{tcp-nodelay:true,protocol:amqp1.0}' -q1 -m100
> send-tp recv-tp l-min   l-max   l-avg   total-tp
> 26086   19552   3.13    6.65    5.28    913
> 	
> However this may not be protons fault, the problem could be in qpidd's
> AMQP1.0 adapter layer. I'm glad to see that we're starting to measure
> these things for proton and dispatch, that will surely lead to
> improvement.
>
Yes, you are correct, that's basically what NTP does ... and neither 
will work well with sub-millisecond ranges. I didn't realize that this 
is what you are after.

There is a beast called 
http://en.wikipedia.org/wiki/Precision_Time_Protocol, though. A year ago 
we took a brief look into this but concluded that millisecond accuracy 
was good enough and that it was not worth the effort.

And of course, it is also possible to attach a GPS receiver to both 
sending and receiving host. With decent drivers this should provide at 
least microsecond accuracy.

Leon

Re: Proton Performance Pictures (1 of 2)

Posted by Alan Conway <ac...@redhat.com>.

On Thu, 2014-09-04 at 00:38 +0200, Leon Mlakar wrote:
> On 04/09/14 00:25, Alan Conway wrote:
> > On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote:
> >> OK -- I just had a quick talk with Ted, and this makes sense
> >> to me now:
> >>
> >>    count *receives* per second.
> >>
> >> I had it turned around and was worried about *sends* per second,
> >> and then got confused by issues of fanout.
> >>
> >> If you only count *receives* per second, and assume no discards,
> >> it seems to me that you can indeed make a fair speed comparison
> >> between
> >>
> >>     sender --> receiver
> >>
> >>     sender --> intermediary --> receiver
> >>
> >> and
> >>
> >>     sender --> intermediary --> {receiver_1 ... receiver_n}
> >>
> >> and even
> >>
> >>     sender --> {arbitrary network of intermediaries} --> {receiver_1 ... receiver_n}
> >>
> >> phew.
> >>
> >>
> >> So I will do it that way.
> > That's right for throughput, but don't forget latency. A well behaved
> > intermediary should have little effect on throughput but will inevitably
> > add latency.
> >
> > Measuring latency between hosts is a pain. You can time-stamp messages
> > at the origin host but clock differences can give you bogus numbers if
> > you compare that to the time on a different host when the messages
> > arrive. One trick is to have the messages arrive back at the same host
> > where you time-stamped them (even if they pass thru other hosts in
> > between) but that isn't always what you really want to measure. Maybe
> > there's something to be done with NNTP, I've never dug into that. Have
> > fun!
> >
> To get a reasonably good estimate of the time difference between sender 
> an receiver, one could exchange several timestamped messages, w/o 
> intermediary, in both directions and get both sides to agree on the 
> difference between them. Do that before the test, and then repeat the 
> exchange at the end of the test to check for the drift. This of course 
> assumes stable network latencies during these exchanges and is usable 
> only in test environments. Exchanging several messages instead of just 
> one should help eliminating sporadic instabilities.
> 

As I understand it that's pretty much what NTP does.
http://en.wikipedia.org/wiki/Network_Time_Protocol says that NTP "can
achieve better than one millisecond accuracy in local area networks
under ideal conditions." That doesn't sound good enough to measure
sub-millisecond latencies. I doubt that a home grown attempt at timing
message exchanges will do better than NTP :( NTP may deserve further
investigation however, Wikipedia probably makes some very broad
assumptions about what your "ideal network conditions" are, its possible
that it can be tuned better than that.

I can easily get sub-millisecond round-trip latencies out of Qpid with a
short message burst:
qpidd --auth=no --tcp-nodelay
qpid-cpp-benchmark --connection-options '{tcp-nodelay:true}' -q1 -m100
send-tp recv-tp l-min   l-max   l-avg   total-tp
38816   30370   0.21    1.18    0.70    3943

Sadly if I tell qpidd to use AMQP 1.0 (and therefore proton), things
degenerate very badly from a latency perspective.
qpid-cpp-benchmark --connection-options '{tcp-nodelay:true,protocol:amqp1.0}' -q1 -m100
send-tp recv-tp l-min   l-max   l-avg   total-tp
26086   19552   3.13    6.65    5.28    913

However this may not be protons fault, the problem could be in qpidd's
AMQP1.0 adapter layer. I'm glad to see that we're starting to measure
these things for proton and dispatch, that will surely lead to
improvement.

Re: Proton Performance Pictures (1 of 2)

Posted by Leon Mlakar <le...@digiverse.si>.

On 04/09/14 00:25, Alan Conway wrote:
> On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote:
>> OK -- I just had a quick talk with Ted, and this makes sense
>> to me now:
>>
>>    count *receives* per second.
>>
>> I had it turned around and was worried about *sends* per second,
>> and then got confused by issues of fanout.
>>
>> If you only count *receives* per second, and assume no discards,
>> it seems to me that you can indeed make a fair speed comparison
>> between
>>
>>     sender --> receiver
>>
>>     sender --> intermediary --> receiver
>>
>> and
>>
>>     sender --> intermediary --> {receiver_1 ... receiver_n}
>>
>> and even
>>
>>     sender --> {arbitrary network of intermediaries} --> {receiver_1 ... receiver_n}
>>
>> phew.
>>
>>
>> So I will do it that way.
> That's right for throughput, but don't forget latency. A well behaved
> intermediary should have little effect on throughput but will inevitably
> add latency.
>
> Measuring latency between hosts is a pain. You can time-stamp messages
> at the origin host but clock differences can give you bogus numbers if
> you compare that to the time on a different host when the messages
> arrive. One trick is to have the messages arrive back at the same host
> where you time-stamped them (even if they pass thru other hosts in
> between) but that isn't always what you really want to measure. Maybe
> there's something to be done with NNTP, I've never dug into that. Have
> fun!
>
To get a reasonably good estimate of the time difference between sender 
an receiver, one could exchange several timestamped messages, w/o 
intermediary, in both directions and get both sides to agree on the 
difference between them. Do that before the test, and then repeat the 
exchange at the end of the test to check for the drift. This of course 
assumes stable network latencies during these exchanges and is usable 
only in test environments. Exchanging several messages instead of just 
one should help eliminating sporadic instabilities.

Leon

Re: Proton Performance Pictures (1 of 2)

Posted by Alan Conway <ac...@redhat.com>.

On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote:
> OK -- I just had a quick talk with Ted, and this makes sense
> to me now:
> 
>   count *receives* per second.
> 
> I had it turned around and was worried about *sends* per second,
> and then got confused by issues of fanout.
> 
> If you only count *receives* per second, and assume no discards,
> it seems to me that you can indeed make a fair speed comparison 
> between
> 
>    sender --> receiver
> 
>    sender --> intermediary --> receiver
> 
> and
> 
>    sender --> intermediary --> {receiver_1 ... receiver_n}
> 
> and even
> 
>    sender --> {arbitrary network of intermediaries} --> {receiver_1 ... receiver_n}
> 
> phew.
> 
> 
> So I will do it that way.

That's right for throughput, but don't forget latency. A well behaved
intermediary should have little effect on throughput but will inevitably
add latency.

Measuring latency between hosts is a pain. You can time-stamp messages
at the origin host but clock differences can give you bogus numbers if
you compare that to the time on a different host when the messages
arrive. One trick is to have the messages arrive back at the same host
where you time-stamped them (even if they pass thru other hosts in
between) but that isn't always what you really want to measure. Maybe
there's something to be done with NNTP, I've never dug into that. Have
fun!

Re: Proton Performance Pictures (1 of 2)

Posted by Michael Goulish <mg...@redhat.com>.

OK -- I just had a quick talk with Ted, and this makes sense
to me now:

  count *receives* per second.

I had it turned around and was worried about *sends* per second,
and then got confused by issues of fanout.

If you only count *receives* per second, and assume no discards,
it seems to me that you can indeed make a fair speed comparison 
between

   sender --> receiver

   sender --> intermediary --> receiver

and

   sender --> intermediary --> {receiver_1 ... receiver_n}

and even

   sender --> {arbitrary network of intermediaries} --> {receiver_1 ... receiver_n}

phew.


So I will do it that way.

This is from the application perspective, asking "how fast is your 
messaging system".  It doesn't care about how fancy the intermediation 
is, it only cares about results.  This seems like the right way to judge that.









----- Original Message -----


On 09/03/2014 11:35 AM, Michael Goulish wrote:
> 
> 
> 
> 
> ----- Original Message -----
>> On 09/03/2014 08:51 AM, Michael Goulish wrote:
>>> That point is where I seem to find the best performance on my
>>> system: 123,500 messages per second received.  ( i.e. 247,000
>>> transfers per second ) using about 180% CPU ( i.e. 90% each of
>>> 2 processors. )
>>
>> If you are sending direct between the sender and receiver process (i.e.
>> no intermediary process), then why are you doubling the number of
>> messages sent to get 'transfers per second'? One transfer is the sending
>> of a message from one process to another, which in this case is the same
>> as messages sent or received.
>>
> 
> Yes, this is interesting.
> 
> I need a way to make a fair comparison between something like this setup 
> (simple peer-to-peer) and the Dispatch Router numbers I was getting
> earlier.
> 
> 
> For the router, the analogous topology is    
> 
>     writer --> router --> reader
> 
> in which case I counted each message twice.
> 
> 
> 
> But it does not seem right to count a single message in
>    writer --> router --> reader 
> as "2 transfers", while counting a single message in
>    writer --> reader
> as only "1 transfer".
> 
> Because -- from the application point of view, those two topologies 
> are doing the same work.

You should probably be using "throughput" and not "transfers" in this case.

> 
> 
> 
> Also I think that I *need* to count    writer-->router-->reader   
> as "2", because in *this* case:
> 
> 
>      writer -->  router --> reader_1
>                       \
>                        \--> reader_2
> 
> 
> ...I need to count that as "3" .
> 
> 
> 
> ? Thoughts ?
> 
>

Re: Proton Performance Pictures (1 of 2)

Posted by Ted Ross <tr...@redhat.com>.


On 09/03/2014 11:35 AM, Michael Goulish wrote:
> 
> 
> 
> 
> ----- Original Message -----
>> On 09/03/2014 08:51 AM, Michael Goulish wrote:
>>> That point is where I seem to find the best performance on my
>>> system: 123,500 messages per second received.  ( i.e. 247,000
>>> transfers per second ) using about 180% CPU ( i.e. 90% each of
>>> 2 processors. )
>>
>> If you are sending direct between the sender and receiver process (i.e.
>> no intermediary process), then why are you doubling the number of
>> messages sent to get 'transfers per second'? One transfer is the sending
>> of a message from one process to another, which in this case is the same
>> as messages sent or received.
>>
> 
> Yes, this is interesting.
> 
> I need a way to make a fair comparison between something like this setup 
> (simple peer-to-peer) and the Dispatch Router numbers I was getting
> earlier.
> 
> 
> For the router, the analogous topology is    
> 
>     writer --> router --> reader
> 
> in which case I counted each message twice.
> 
> 
> 
> But it does not seem right to count a single message in
>    writer --> router --> reader 
> as "2 transfers", while counting a single message in
>    writer --> reader
> as only "1 transfer".
> 
> Because -- from the application point of view, those two topologies 
> are doing the same work.

You should probably be using "throughput" and not "transfers" in this case.

> 
> 
> 
> Also I think that I *need* to count    writer-->router-->reader   
> as "2", because in *this* case:
> 
> 
>      writer -->  router --> reader_1
>                       \
>                        \--> reader_2
> 
> 
> ...I need to count that as "3" .
> 
> 
> 
> ? Thoughts ?
> 
>

Re: Proton Performance Pictures (1 of 2)

Posted by Michael Goulish <mg...@redhat.com>.

----- Original Message -----
> On 09/03/2014 08:51 AM, Michael Goulish wrote:
> > That point is where I seem to find the best performance on my
> > system: 123,500 messages per second received.  ( i.e. 247,000
> > transfers per second ) using about 180% CPU ( i.e. 90% each of
> > 2 processors. )
> 
> If you are sending direct between the sender and receiver process (i.e.
> no intermediary process), then why are you doubling the number of
> messages sent to get 'transfers per second'? One transfer is the sending
> of a message from one process to another, which in this case is the same
> as messages sent or received.
> 

Yes, this is interesting.

I need a way to make a fair comparison between something like this setup 
(simple peer-to-peer) and the Dispatch Router numbers I was getting
earlier.

For the router, the analogous topology is    

    writer --> router --> reader

in which case I counted each message twice.

But it does not seem right to count a single message in
   writer --> router --> reader 
as "2 transfers", while counting a single message in
   writer --> reader
as only "1 transfer".

Because -- from the application point of view, those two topologies 
are doing the same work.

Also I think that I *need* to count    writer-->router-->reader   
as "2", because in *this* case:

     writer -->  router --> reader_1
                      \
                       \--> reader_2

...I need to count that as "3" .

? Thoughts ?

Re: Proton Performance Pictures (1 of 2)

Posted by Gordon Sim <gs...@redhat.com>.

On 09/03/2014 08:51 AM, Michael Goulish wrote:
> That point is where I seem to find the best performance on my
> system: 123,500 messages per second received.  ( i.e. 247,000
> transfers per second ) using about 180% CPU ( i.e. 90% each of
> 2 processors. )

If you are sending direct between the sender and receiver process (i.e. 
no intermediary process), then why are you doubling the number of 
messages sent to get 'transfers per second'? One transfer is the sending 
of a message from one process to another, which in this case is the same 
as messages sent or received.