You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Yuheng Du <yu...@gmail.com> on 2015/09/04 04:45:09 UTC

latency test

I am running a producer latency test. When using 92 producers in 92
physical node publishing to 4 brokers, the latency is slightly lower than
using 8 brokers, I am using 8 partitions for the topic.

I have rerun the test and it gives me the same result, the 4 brokers
scenario still has lower latency than the 8 brokers scenarios.

It is weird because I tested 1broker, 2 brokers, 4 brokers, 8 brokers, 16
brokers and 32 brokers. For the rest of the case the latency decreases as
the number of brokers increase.

4 brokers/8 brokers is the only pair that doesn't satisfy this rule. What
could be the cause?

I am using a 200 bytes message, the test let each producer publishes 500k
messages to a given topic. Every test run when I change the number of
brokers, I use a new topic.

Thanks for any advices.

Re: latency test

Posted by Yuheng Du <yu...@gmail.com>.

Thank you Erik.

In my test I am using fixed 200bytes messages and I run 500k messages per
producer on 92 physically isolated producers. Each test run takes about 20
minutes. As the broker cluster is migrated into a new physical cluster, I
will perform my test and get the latency results in the next couple of
weeks.

I will keep you posted.

Thanks.

On Wed, Sep 9, 2015 at 4:58 PM, Helleren, Erik <Er...@cmegroup.com>
wrote:

> Yes, and that can really hurt average performance.  All the partitions
> were nearly identical up to the 99%’ile, and had very good performance at
> that level hovering around a few milli’s.  But when looking beyond the
> 99%’ile, there was that clear fork in the distribution where a set of 3
> partitions surged upwards.  This could be for a dozen different reasons:
> Network blips, noisy networks, location in the network, resource
> contention on that broker, etc.  But it effected that one broker more than
> others.  And the reasons for my cluster displaying this behavior could be
> very different than the reason for any other cluster.
>
> Its worth noting that this was mostly a latency test over a stress test.
> There was a single kafka producer object, very small message sizes (100
> bytes), and it was only pushing through around 5MB/s worth of data. And
> the client was configured to minimize the amount of data that would be on
> the internal queue/buffer waiting to be sent.  The messages that were
> being sent were compromised of 10 byte ascii ‘words’ selected randomly
> from a dictionary of 1000 words, which benefits compression while still
> resulting in likely unique messages.  And the test I ran was only for 6
> min, and I did not do the work required to see if there was a burst of
> slower messages which caused this behavior, or if it was a consistent
> issue with that node.
> -Erik
>
>
> On 9/9/15, 2:24 PM, "Yuheng Du" <yu...@gmail.com> wrote:
>
> >So are you suggesting that the long delays happened in %1 percentile
> >happens in the slower partitions that are further away? Thanks.
> >
> >On Wed, Sep 9, 2015 at 3:15 PM, Helleren, Erik
> ><Er...@cmegroup.com>
> >wrote:
> >
> >> So, I did my own latency test on a cluster of 3 nodes, and there is a
> >> significant difference around the 99%’ile and higher for partitions when
> >> measuring the the ack time when configured for a single ack.  The graph
> >> that I wish I could attach or post clearly shows that around 1/3 of the
> >> partitions significantly diverge from the other two.  So, at least in my
> >> case, one of my brokers is further than the others.
> >> -Erik
> >>
> >> On 9/4/15, 1:06 PM, "Yuheng Du" <yu...@gmail.com> wrote:
> >>
> >> >No problem. Thanks for your advice. I think it would be fun to
> >>explore. I
> >> >only know how to program in java though. Hope it will work.
> >> >
> >> >On Fri, Sep 4, 2015 at 2:03 PM, Helleren, Erik
> >> ><Er...@cmegroup.com>
> >> >wrote:
> >> >
> >> >> I thing the suggestion is to have partitions/brokers >=1, so 32
> >>should
> >> >>be
> >> >> enough.
> >> >>
> >> >> As for latency tests, there isn’t a lot of code to do a latency test.
> >> >>If
> >> >> you just want to measure ack time its around 100 lines.  I will try
> >>to
> >> >> push out some good latency testing code to github, but my company is
> >> >> scared of open sourcing code… so it might be a while…
> >> >> -Erik
> >> >>
> >> >>
> >> >> On 9/4/15, 12:55 PM, "Yuheng Du" <yu...@gmail.com> wrote:
> >> >>
> >> >> >Thanks for your reply Erik. I am running some more tests according
> >>to
> >> >>your
> >> >> >suggestions now and I will share with my results here. Is it
> >>necessary
> >> >>to
> >> >> >use a fixed number of partitions (32 partitions maybe) for my test?
> >> >> >
> >> >> >I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are
> >> >>running
> >> >> >on individual physical nodes. So I think using at least 32
> >>partitions
> >> >>will
> >> >> >make more sense? I have seen latencies increase as the number of
> >> >> >partitions
> >> >> >goes up in my experiments.
> >> >> >
> >> >> >To get the latency of each event data recorded, are you suggesting
> >> >>that I
> >> >> >rewrite my own test program (in Java perhaps) or I can just modify
> >>the
> >> >> >standard test program provided by kafka (
> >> >> >https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I
> >>need
> >> >>to
> >> >> >rebuild the source if I modify the standard java test program
> >> >> >ProducerPerformance provided in kafka, right? Now this standard
> >>program
> >> >> >only has average latencies and percentile latencies but no per event
> >> >> >latencies.
> >> >> >
> >> >> >Thanks.
> >> >> >
> >> >> >On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik
> >> >> ><Er...@cmegroup.com>
> >> >> >wrote:
> >> >> >
> >> >> >> That is an excellent question!  There are a bunch of ways to
> >>monitor
> >> >> >> jitter and see when that is happening.  Here are a few:
> >> >> >>
> >> >> >> - You could slice the histogram every few seconds, save it out
> >>with a
> >> >> >> timestamp, and then look at how they compare.  This would be
> >>mostly
> >> >> >> manual, or you can graph line charts of the percentiles over time
> >>in
> >> >> >>excel
> >> >> >> where each percentile would be a series.  If you are using HDR
> >> >> >>Histogram,
> >> >> >> you should look at how to use the Recorder class to do this
> >>coupled
> >> >> >>with a
> >> >> >> ScheduledExecutorService.
> >> >> >>
> >> >> >> - You can just save the starting timestamp of the event and the
> >> >>latency
> >> >> >>of
> >> >> >> each event.  If you put it into a CSV, you can just load it up
> >>into
> >> >> >>excel
> >> >> >> and graph as a XY chart.  That way you can see every point during
> >>the
> >> >> >> running of your program and you can see trends.  You want to be
> >> >>careful
> >> >> >> about this one, especially of writing to a file in the callback
> >>that
> >> >> >>kfaka
> >> >> >> provides.
> >> >> >>
> >> >> >> Also, I have noticed that most of the very slow observations are
> >>at
> >> >> >> startup.  But don’t trust me, trust the data and share your
> >>findings.
> >> >> >> Also, having a 99.9 percentile provides a pretty good standard for
> >> >> >>typical
> >> >> >> poor case performance.  Average is borderline useless, 50%’ile is
> >>a
> >> >> >>better
> >> >> >> typical case because that’s the number that says “half of events
> >> >>will be
> >> >> >> this slow or faster”, or for values that are high like 99.9%’ile,
> >> >>“0.1%
> >> >> >>of
> >> >> >> all events will be slower than this”.
> >> >> >> -Erik
> >> >> >>
> >> >> >> On 9/4/15, 12:05 PM, "Yuheng Du" <yu...@gmail.com>
> wrote:
> >> >> >>
> >> >> >> >Thank you Erik! That's is helpful!
> >> >> >> >
> >> >> >> >But also I see jitters of the maximum latencies when running the
> >> >> >> >experiment.
> >> >> >> >
> >> >> >> >The average end to acknowledgement latency from producer to
> >>broker
> >> >>is
> >> >> >> >around 5ms when using 92 producers and 4 brokers, and the 99.9
> >> >> >>percentile
> >> >> >> >latency is 58ms, but the maximum latency goes up to 1359 ms. How
> >>to
> >> >> >>locate
> >> >> >> >the source of this jitter?
> >> >> >> >
> >> >> >> >Thanks.
> >> >> >> >
> >> >> >> >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
> >> >> >> ><Er...@cmegroup.com>
> >> >> >> >wrote:
> >> >> >> >
> >> >> >> >> WellŠ not to be contrarian, but latency depends much more on
> >>the
> >> >> >>latency
> >> >> >> >> between the producer and the broker that is the leader for the
> >> >> >>partition
> >> >> >> >> you are publishing to.  At least when your brokers are not
> >> >>saturated
> >> >> >> >>with
> >> >> >> >> messages, and acks are set to 1.  If acks are set to ALL,
> >>latency
> >> >>on
> >> >> >>an
> >> >> >> >> non-saturated kafka cluster will be: Round Trip Latency from
> >> >> >>producer to
> >> >> >> >> leader for partition + Max( slowest Round Trip Latency to a
> >> >>replicas
> >> >> >>of
> >> >> >> >> that partition).  If a cluster is saturated with messages, we
> >> >>have to
> >> >> >> >> assume that all partitions receive an equal distribution of
> >> >>messages
> >> >> >>to
> >> >> >> >> avoid linear algebra and queueing theory models.  I don¹t like
> >> >>linear
> >> >> >> >> algebra :P
> >> >> >> >>
> >> >> >> >> Since you are probably putting all your latencies into a single
> >> >> >> >>histogram
> >> >> >> >> per producer, or worse, just an average, this pattern would
> >>have
> >> >>been
> >> >> >> >> obscured.  Obligatory lecture about measuring latency by Gil
> >>Tene
> >> >> >> >> (https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this
> >> >> >> >>hypothesis,
> >> >> >> >> you should re-write the benchmark to plot the latencies for
> >>each
> >> >> >>write
> >> >> >> >>to
> >> >> >> >> a partition for each producer into a histogram. (HRD histogram
> >>is
> >> >> >>pretty
> >> >> >> >> good for that).  This would give you producers*partitions
> >> >>histograms,
> >> >> >> >> which might be unwieldy for that many producers. But wait,
> >>there
> >> >>is
> >> >> >> >>hope!
> >> >> >> >>
> >> >> >> >> To verify that this hypothesis holds, you just have to see that
> >> >>there
> >> >> >> >>is a
> >> >> >> >> significant difference between different partitions on a SINGLE
> >> >> >> >>producing
> >> >> >> >> client. So, pick one producing client at random and use the
> >>data
> >> >>from
> >> >> >> >> that. The easy way to do that is just plot all the partition
> >> >>latency
> >> >> >> >> histograms on top of each other in the same plot, that way you
> >> >>have a
> >> >> >> >> pretty plot to show people.  If you don¹t want to setup
> >>plotting,
> >> >>you
> >> >> >> >>can
> >> >> >> >> just compare the medians (50¹th percentile) of the partitions¹
> >> >> >> >>histograms.
> >> >> >> >>  If there is a lot of variance, your latency anomaly is
> >>explained
> >> >>by
> >> >> >> >> brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot
> >>of
> >> >> >> >>variance
> >> >> >> >> at 50%, look at higher percentiles.  And if higher percentiles
> >>for
> >> >> >>all
> >> >> >> >>the
> >> >> >> >> partitions look the same, this hypothesis is disproved.
> >> >> >> >>
> >> >> >> >> If you want to make a general statement about latency of
> >>writing
> >> >>to
> >> >> >> >>kafka,
> >> >> >> >> you can merge all the histograms into a single histogram and
> >>plot
> >> >> >>that.
> >> >> >> >>
> >> >> >> >> To Yuheng¹s credit, more brokers always results in more
> >> >>throughput.
> >> >> >>But
> >> >> >> >> throughput and latency are two different creatures.  Its worth
> >> >>noting
> >> >> >> >>that
> >> >> >> >> kafka is designed to be high throughput first and low latency
> >> >>second.
> >> >> >> >>And
> >> >> >> >> it does a really good job at both.
> >> >> >> >>
> >> >> >> >> Disclaimer: I might not like linear algebra, but I do like
> >> >> >>statistics.
> >> >> >> >> Let me know if there are topics that need more explanation
> >>above
> >> >>that
> >> >> >> >> aren¹t covered by Gil¹s lecture.
> >> >> >> >> -Erik
> >> >> >> >>
> >> >> >> >> On 9/4/15, 9:03 AM, "Yuheng Du" <yu...@gmail.com>
> >>wrote:
> >> >> >> >>
> >> >> >> >> >When I using 32 partitions, the 4 brokers latency becomes
> >>larger
> >> >> >>than
> >> >> >> >>the
> >> >> >> >> >8
> >> >> >> >> >brokers latency.
> >> >> >> >> >
> >> >> >> >> >So is it always true that using more brokers can give less
> >> >>latency
> >> >> >>when
> >> >> >> >> >the
> >> >> >> >> >number of partitions is at least the size of the brokers?
> >> >> >> >> >
> >> >> >> >> >Thanks.
> >> >> >> >> >
> >> >> >> >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du
> >> >> >><yu...@gmail.com>
> >> >> >> >> >wrote:
> >> >> >> >> >
> >> >> >> >> >> I am running a producer latency test. When using 92
> >>producers
> >> >>in
> >> >> >>92
> >> >> >> >> >> physical node publishing to 4 brokers, the latency is
> >>slightly
> >> >> >>lower
> >> >> >> >> >>than
> >> >> >> >> >> using 8 brokers, I am using 8 partitions for the topic.
> >> >> >> >> >>
> >> >> >> >> >> I have rerun the test and it gives me the same result, the 4
> >> >> >>brokers
> >> >> >> >> >> scenario still has lower latency than the 8 brokers
> >>scenarios.
> >> >> >> >> >>
> >> >> >> >> >> It is weird because I tested 1broker, 2 brokers, 4 brokers,
> >>8
> >> >> >> >>brokers,
> >> >> >> >> >>16
> >> >> >> >> >> brokers and 32 brokers. For the rest of the case the latency
> >> >> >> >>decreases
> >> >> >> >> >>as
> >> >> >> >> >> the number of brokers increase.
> >> >> >> >> >>
> >> >> >> >> >> 4 brokers/8 brokers is the only pair that doesn't satisfy
> >>this
> >> >> >>rule.
> >> >> >> >> >>What
> >> >> >> >> >> could be the cause?
> >> >> >> >> >>
> >> >> >> >> >> I am using a 200 bytes message, the test let each producer
> >> >> >>publishes
> >> >> >> >> >>500k
> >> >> >> >> >> messages to a given topic. Every test run when I change the
> >> >> >>number of
> >> >> >> >> >> brokers, I use a new topic.
> >> >> >> >> >>
> >> >> >> >> >> Thanks for any advices.
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >>
> >>
>
>

Re: latency test

Posted by "Helleren, Erik" <Er...@cmegroup.com>.

Yes, and that can really hurt average performance.  All the partitions
were nearly identical up to the 99%’ile, and had very good performance at
that level hovering around a few milli’s.  But when looking beyond the
99%’ile, there was that clear fork in the distribution where a set of 3
partitions surged upwards.  This could be for a dozen different reasons:
Network blips, noisy networks, location in the network, resource
contention on that broker, etc.  But it effected that one broker more than
others.  And the reasons for my cluster displaying this behavior could be
very different than the reason for any other cluster.

Its worth noting that this was mostly a latency test over a stress test.
There was a single kafka producer object, very small message sizes (100
bytes), and it was only pushing through around 5MB/s worth of data. And
the client was configured to minimize the amount of data that would be on
the internal queue/buffer waiting to be sent.  The messages that were
being sent were compromised of 10 byte ascii ‘words’ selected randomly
from a dictionary of 1000 words, which benefits compression while still
resulting in likely unique messages.  And the test I ran was only for 6
min, and I did not do the work required to see if there was a burst of
slower messages which caused this behavior, or if it was a consistent
issue with that node.
-Erik


On 9/9/15, 2:24 PM, "Yuheng Du" <yu...@gmail.com> wrote:

>So are you suggesting that the long delays happened in %1 percentile
>happens in the slower partitions that are further away? Thanks.
>
>On Wed, Sep 9, 2015 at 3:15 PM, Helleren, Erik
><Er...@cmegroup.com>
>wrote:
>
>> So, I did my own latency test on a cluster of 3 nodes, and there is a
>> significant difference around the 99%’ile and higher for partitions when
>> measuring the the ack time when configured for a single ack.  The graph
>> that I wish I could attach or post clearly shows that around 1/3 of the
>> partitions significantly diverge from the other two.  So, at least in my
>> case, one of my brokers is further than the others.
>> -Erik
>>
>> On 9/4/15, 1:06 PM, "Yuheng Du" <yu...@gmail.com> wrote:
>>
>> >No problem. Thanks for your advice. I think it would be fun to
>>explore. I
>> >only know how to program in java though. Hope it will work.
>> >
>> >On Fri, Sep 4, 2015 at 2:03 PM, Helleren, Erik
>> ><Er...@cmegroup.com>
>> >wrote:
>> >
>> >> I thing the suggestion is to have partitions/brokers >=1, so 32
>>should
>> >>be
>> >> enough.
>> >>
>> >> As for latency tests, there isn’t a lot of code to do a latency test.
>> >>If
>> >> you just want to measure ack time its around 100 lines.  I will try
>>to
>> >> push out some good latency testing code to github, but my company is
>> >> scared of open sourcing code… so it might be a while…
>> >> -Erik
>> >>
>> >>
>> >> On 9/4/15, 12:55 PM, "Yuheng Du" <yu...@gmail.com> wrote:
>> >>
>> >> >Thanks for your reply Erik. I am running some more tests according
>>to
>> >>your
>> >> >suggestions now and I will share with my results here. Is it
>>necessary
>> >>to
>> >> >use a fixed number of partitions (32 partitions maybe) for my test?
>> >> >
>> >> >I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are
>> >>running
>> >> >on individual physical nodes. So I think using at least 32
>>partitions
>> >>will
>> >> >make more sense? I have seen latencies increase as the number of
>> >> >partitions
>> >> >goes up in my experiments.
>> >> >
>> >> >To get the latency of each event data recorded, are you suggesting
>> >>that I
>> >> >rewrite my own test program (in Java perhaps) or I can just modify
>>the
>> >> >standard test program provided by kafka (
>> >> >https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I
>>need
>> >>to
>> >> >rebuild the source if I modify the standard java test program
>> >> >ProducerPerformance provided in kafka, right? Now this standard
>>program
>> >> >only has average latencies and percentile latencies but no per event
>> >> >latencies.
>> >> >
>> >> >Thanks.
>> >> >
>> >> >On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik
>> >> ><Er...@cmegroup.com>
>> >> >wrote:
>> >> >
>> >> >> That is an excellent question!  There are a bunch of ways to
>>monitor
>> >> >> jitter and see when that is happening.  Here are a few:
>> >> >>
>> >> >> - You could slice the histogram every few seconds, save it out
>>with a
>> >> >> timestamp, and then look at how they compare.  This would be
>>mostly
>> >> >> manual, or you can graph line charts of the percentiles over time
>>in
>> >> >>excel
>> >> >> where each percentile would be a series.  If you are using HDR
>> >> >>Histogram,
>> >> >> you should look at how to use the Recorder class to do this
>>coupled
>> >> >>with a
>> >> >> ScheduledExecutorService.
>> >> >>
>> >> >> - You can just save the starting timestamp of the event and the
>> >>latency
>> >> >>of
>> >> >> each event.  If you put it into a CSV, you can just load it up
>>into
>> >> >>excel
>> >> >> and graph as a XY chart.  That way you can see every point during
>>the
>> >> >> running of your program and you can see trends.  You want to be
>> >>careful
>> >> >> about this one, especially of writing to a file in the callback
>>that
>> >> >>kfaka
>> >> >> provides.
>> >> >>
>> >> >> Also, I have noticed that most of the very slow observations are
>>at
>> >> >> startup.  But don’t trust me, trust the data and share your
>>findings.
>> >> >> Also, having a 99.9 percentile provides a pretty good standard for
>> >> >>typical
>> >> >> poor case performance.  Average is borderline useless, 50%’ile is
>>a
>> >> >>better
>> >> >> typical case because that’s the number that says “half of events
>> >>will be
>> >> >> this slow or faster”, or for values that are high like 99.9%’ile,
>> >>“0.1%
>> >> >>of
>> >> >> all events will be slower than this”.
>> >> >> -Erik
>> >> >>
>> >> >> On 9/4/15, 12:05 PM, "Yuheng Du" <yu...@gmail.com> wrote:
>> >> >>
>> >> >> >Thank you Erik! That's is helpful!
>> >> >> >
>> >> >> >But also I see jitters of the maximum latencies when running the
>> >> >> >experiment.
>> >> >> >
>> >> >> >The average end to acknowledgement latency from producer to
>>broker
>> >>is
>> >> >> >around 5ms when using 92 producers and 4 brokers, and the 99.9
>> >> >>percentile
>> >> >> >latency is 58ms, but the maximum latency goes up to 1359 ms. How
>>to
>> >> >>locate
>> >> >> >the source of this jitter?
>> >> >> >
>> >> >> >Thanks.
>> >> >> >
>> >> >> >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
>> >> >> ><Er...@cmegroup.com>
>> >> >> >wrote:
>> >> >> >
>> >> >> >> WellŠ not to be contrarian, but latency depends much more on
>>the
>> >> >>latency
>> >> >> >> between the producer and the broker that is the leader for the
>> >> >>partition
>> >> >> >> you are publishing to.  At least when your brokers are not
>> >>saturated
>> >> >> >>with
>> >> >> >> messages, and acks are set to 1.  If acks are set to ALL,
>>latency
>> >>on
>> >> >>an
>> >> >> >> non-saturated kafka cluster will be: Round Trip Latency from
>> >> >>producer to
>> >> >> >> leader for partition + Max( slowest Round Trip Latency to a
>> >>replicas
>> >> >>of
>> >> >> >> that partition).  If a cluster is saturated with messages, we
>> >>have to
>> >> >> >> assume that all partitions receive an equal distribution of
>> >>messages
>> >> >>to
>> >> >> >> avoid linear algebra and queueing theory models.  I don¹t like
>> >>linear
>> >> >> >> algebra :P
>> >> >> >>
>> >> >> >> Since you are probably putting all your latencies into a single
>> >> >> >>histogram
>> >> >> >> per producer, or worse, just an average, this pattern would
>>have
>> >>been
>> >> >> >> obscured.  Obligatory lecture about measuring latency by Gil
>>Tene
>> >> >> >> (https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this
>> >> >> >>hypothesis,
>> >> >> >> you should re-write the benchmark to plot the latencies for
>>each
>> >> >>write
>> >> >> >>to
>> >> >> >> a partition for each producer into a histogram. (HRD histogram
>>is
>> >> >>pretty
>> >> >> >> good for that).  This would give you producers*partitions
>> >>histograms,
>> >> >> >> which might be unwieldy for that many producers. But wait,
>>there
>> >>is
>> >> >> >>hope!
>> >> >> >>
>> >> >> >> To verify that this hypothesis holds, you just have to see that
>> >>there
>> >> >> >>is a
>> >> >> >> significant difference between different partitions on a SINGLE
>> >> >> >>producing
>> >> >> >> client. So, pick one producing client at random and use the
>>data
>> >>from
>> >> >> >> that. The easy way to do that is just plot all the partition
>> >>latency
>> >> >> >> histograms on top of each other in the same plot, that way you
>> >>have a
>> >> >> >> pretty plot to show people.  If you don¹t want to setup
>>plotting,
>> >>you
>> >> >> >>can
>> >> >> >> just compare the medians (50¹th percentile) of the partitions¹
>> >> >> >>histograms.
>> >> >> >>  If there is a lot of variance, your latency anomaly is
>>explained
>> >>by
>> >> >> >> brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot
>>of
>> >> >> >>variance
>> >> >> >> at 50%, look at higher percentiles.  And if higher percentiles
>>for
>> >> >>all
>> >> >> >>the
>> >> >> >> partitions look the same, this hypothesis is disproved.
>> >> >> >>
>> >> >> >> If you want to make a general statement about latency of
>>writing
>> >>to
>> >> >> >>kafka,
>> >> >> >> you can merge all the histograms into a single histogram and
>>plot
>> >> >>that.
>> >> >> >>
>> >> >> >> To Yuheng¹s credit, more brokers always results in more
>> >>throughput.
>> >> >>But
>> >> >> >> throughput and latency are two different creatures.  Its worth
>> >>noting
>> >> >> >>that
>> >> >> >> kafka is designed to be high throughput first and low latency
>> >>second.
>> >> >> >>And
>> >> >> >> it does a really good job at both.
>> >> >> >>
>> >> >> >> Disclaimer: I might not like linear algebra, but I do like
>> >> >>statistics.
>> >> >> >> Let me know if there are topics that need more explanation
>>above
>> >>that
>> >> >> >> aren¹t covered by Gil¹s lecture.
>> >> >> >> -Erik
>> >> >> >>
>> >> >> >> On 9/4/15, 9:03 AM, "Yuheng Du" <yu...@gmail.com>
>>wrote:
>> >> >> >>
>> >> >> >> >When I using 32 partitions, the 4 brokers latency becomes
>>larger
>> >> >>than
>> >> >> >>the
>> >> >> >> >8
>> >> >> >> >brokers latency.
>> >> >> >> >
>> >> >> >> >So is it always true that using more brokers can give less
>> >>latency
>> >> >>when
>> >> >> >> >the
>> >> >> >> >number of partitions is at least the size of the brokers?
>> >> >> >> >
>> >> >> >> >Thanks.
>> >> >> >> >
>> >> >> >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du
>> >> >><yu...@gmail.com>
>> >> >> >> >wrote:
>> >> >> >> >
>> >> >> >> >> I am running a producer latency test. When using 92
>>producers
>> >>in
>> >> >>92
>> >> >> >> >> physical node publishing to 4 brokers, the latency is
>>slightly
>> >> >>lower
>> >> >> >> >>than
>> >> >> >> >> using 8 brokers, I am using 8 partitions for the topic.
>> >> >> >> >>
>> >> >> >> >> I have rerun the test and it gives me the same result, the 4
>> >> >>brokers
>> >> >> >> >> scenario still has lower latency than the 8 brokers
>>scenarios.
>> >> >> >> >>
>> >> >> >> >> It is weird because I tested 1broker, 2 brokers, 4 brokers,
>>8
>> >> >> >>brokers,
>> >> >> >> >>16
>> >> >> >> >> brokers and 32 brokers. For the rest of the case the latency
>> >> >> >>decreases
>> >> >> >> >>as
>> >> >> >> >> the number of brokers increase.
>> >> >> >> >>
>> >> >> >> >> 4 brokers/8 brokers is the only pair that doesn't satisfy
>>this
>> >> >>rule.
>> >> >> >> >>What
>> >> >> >> >> could be the cause?
>> >> >> >> >>
>> >> >> >> >> I am using a 200 bytes message, the test let each producer
>> >> >>publishes
>> >> >> >> >>500k
>> >> >> >> >> messages to a given topic. Every test run when I change the
>> >> >>number of
>> >> >> >> >> brokers, I use a new topic.
>> >> >> >> >>
>> >> >> >> >> Thanks for any advices.
>> >> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>

Re: latency test

Posted by Yuheng Du <yu...@gmail.com>.

So are you suggesting that the long delays happened in %1 percentile
happens in the slower partitions that are further away? Thanks.

On Wed, Sep 9, 2015 at 3:15 PM, Helleren, Erik <Er...@cmegroup.com>
wrote:

> So, I did my own latency test on a cluster of 3 nodes, and there is a
> significant difference around the 99%’ile and higher for partitions when
> measuring the the ack time when configured for a single ack.  The graph
> that I wish I could attach or post clearly shows that around 1/3 of the
> partitions significantly diverge from the other two.  So, at least in my
> case, one of my brokers is further than the others.
> -Erik
>
> On 9/4/15, 1:06 PM, "Yuheng Du" <yu...@gmail.com> wrote:
>
> >No problem. Thanks for your advice. I think it would be fun to explore. I
> >only know how to program in java though. Hope it will work.
> >
> >On Fri, Sep 4, 2015 at 2:03 PM, Helleren, Erik
> ><Er...@cmegroup.com>
> >wrote:
> >
> >> I thing the suggestion is to have partitions/brokers >=1, so 32 should
> >>be
> >> enough.
> >>
> >> As for latency tests, there isn’t a lot of code to do a latency test.
> >>If
> >> you just want to measure ack time its around 100 lines.  I will try to
> >> push out some good latency testing code to github, but my company is
> >> scared of open sourcing code… so it might be a while…
> >> -Erik
> >>
> >>
> >> On 9/4/15, 12:55 PM, "Yuheng Du" <yu...@gmail.com> wrote:
> >>
> >> >Thanks for your reply Erik. I am running some more tests according to
> >>your
> >> >suggestions now and I will share with my results here. Is it necessary
> >>to
> >> >use a fixed number of partitions (32 partitions maybe) for my test?
> >> >
> >> >I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are
> >>running
> >> >on individual physical nodes. So I think using at least 32 partitions
> >>will
> >> >make more sense? I have seen latencies increase as the number of
> >> >partitions
> >> >goes up in my experiments.
> >> >
> >> >To get the latency of each event data recorded, are you suggesting
> >>that I
> >> >rewrite my own test program (in Java perhaps) or I can just modify the
> >> >standard test program provided by kafka (
> >> >https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I need
> >>to
> >> >rebuild the source if I modify the standard java test program
> >> >ProducerPerformance provided in kafka, right? Now this standard program
> >> >only has average latencies and percentile latencies but no per event
> >> >latencies.
> >> >
> >> >Thanks.
> >> >
> >> >On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik
> >> ><Er...@cmegroup.com>
> >> >wrote:
> >> >
> >> >> That is an excellent question!  There are a bunch of ways to monitor
> >> >> jitter and see when that is happening.  Here are a few:
> >> >>
> >> >> - You could slice the histogram every few seconds, save it out with a
> >> >> timestamp, and then look at how they compare.  This would be mostly
> >> >> manual, or you can graph line charts of the percentiles over time in
> >> >>excel
> >> >> where each percentile would be a series.  If you are using HDR
> >> >>Histogram,
> >> >> you should look at how to use the Recorder class to do this coupled
> >> >>with a
> >> >> ScheduledExecutorService.
> >> >>
> >> >> - You can just save the starting timestamp of the event and the
> >>latency
> >> >>of
> >> >> each event.  If you put it into a CSV, you can just load it up into
> >> >>excel
> >> >> and graph as a XY chart.  That way you can see every point during the
> >> >> running of your program and you can see trends.  You want to be
> >>careful
> >> >> about this one, especially of writing to a file in the callback that
> >> >>kfaka
> >> >> provides.
> >> >>
> >> >> Also, I have noticed that most of the very slow observations are at
> >> >> startup.  But don’t trust me, trust the data and share your findings.
> >> >> Also, having a 99.9 percentile provides a pretty good standard for
> >> >>typical
> >> >> poor case performance.  Average is borderline useless, 50%’ile is a
> >> >>better
> >> >> typical case because that’s the number that says “half of events
> >>will be
> >> >> this slow or faster”, or for values that are high like 99.9%’ile,
> >>“0.1%
> >> >>of
> >> >> all events will be slower than this”.
> >> >> -Erik
> >> >>
> >> >> On 9/4/15, 12:05 PM, "Yuheng Du" <yu...@gmail.com> wrote:
> >> >>
> >> >> >Thank you Erik! That's is helpful!
> >> >> >
> >> >> >But also I see jitters of the maximum latencies when running the
> >> >> >experiment.
> >> >> >
> >> >> >The average end to acknowledgement latency from producer to broker
> >>is
> >> >> >around 5ms when using 92 producers and 4 brokers, and the 99.9
> >> >>percentile
> >> >> >latency is 58ms, but the maximum latency goes up to 1359 ms. How to
> >> >>locate
> >> >> >the source of this jitter?
> >> >> >
> >> >> >Thanks.
> >> >> >
> >> >> >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
> >> >> ><Er...@cmegroup.com>
> >> >> >wrote:
> >> >> >
> >> >> >> WellŠ not to be contrarian, but latency depends much more on the
> >> >>latency
> >> >> >> between the producer and the broker that is the leader for the
> >> >>partition
> >> >> >> you are publishing to.  At least when your brokers are not
> >>saturated
> >> >> >>with
> >> >> >> messages, and acks are set to 1.  If acks are set to ALL, latency
> >>on
> >> >>an
> >> >> >> non-saturated kafka cluster will be: Round Trip Latency from
> >> >>producer to
> >> >> >> leader for partition + Max( slowest Round Trip Latency to a
> >>replicas
> >> >>of
> >> >> >> that partition).  If a cluster is saturated with messages, we
> >>have to
> >> >> >> assume that all partitions receive an equal distribution of
> >>messages
> >> >>to
> >> >> >> avoid linear algebra and queueing theory models.  I don¹t like
> >>linear
> >> >> >> algebra :P
> >> >> >>
> >> >> >> Since you are probably putting all your latencies into a single
> >> >> >>histogram
> >> >> >> per producer, or worse, just an average, this pattern would have
> >>been
> >> >> >> obscured.  Obligatory lecture about measuring latency by Gil Tene
> >> >> >> (https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this
> >> >> >>hypothesis,
> >> >> >> you should re-write the benchmark to plot the latencies for each
> >> >>write
> >> >> >>to
> >> >> >> a partition for each producer into a histogram. (HRD histogram is
> >> >>pretty
> >> >> >> good for that).  This would give you producers*partitions
> >>histograms,
> >> >> >> which might be unwieldy for that many producers. But wait, there
> >>is
> >> >> >>hope!
> >> >> >>
> >> >> >> To verify that this hypothesis holds, you just have to see that
> >>there
> >> >> >>is a
> >> >> >> significant difference between different partitions on a SINGLE
> >> >> >>producing
> >> >> >> client. So, pick one producing client at random and use the data
> >>from
> >> >> >> that. The easy way to do that is just plot all the partition
> >>latency
> >> >> >> histograms on top of each other in the same plot, that way you
> >>have a
> >> >> >> pretty plot to show people.  If you don¹t want to setup plotting,
> >>you
> >> >> >>can
> >> >> >> just compare the medians (50¹th percentile) of the partitions¹
> >> >> >>histograms.
> >> >> >>  If there is a lot of variance, your latency anomaly is explained
> >>by
> >> >> >> brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot of
> >> >> >>variance
> >> >> >> at 50%, look at higher percentiles.  And if higher percentiles for
> >> >>all
> >> >> >>the
> >> >> >> partitions look the same, this hypothesis is disproved.
> >> >> >>
> >> >> >> If you want to make a general statement about latency of writing
> >>to
> >> >> >>kafka,
> >> >> >> you can merge all the histograms into a single histogram and plot
> >> >>that.
> >> >> >>
> >> >> >> To Yuheng¹s credit, more brokers always results in more
> >>throughput.
> >> >>But
> >> >> >> throughput and latency are two different creatures.  Its worth
> >>noting
> >> >> >>that
> >> >> >> kafka is designed to be high throughput first and low latency
> >>second.
> >> >> >>And
> >> >> >> it does a really good job at both.
> >> >> >>
> >> >> >> Disclaimer: I might not like linear algebra, but I do like
> >> >>statistics.
> >> >> >> Let me know if there are topics that need more explanation above
> >>that
> >> >> >> aren¹t covered by Gil¹s lecture.
> >> >> >> -Erik
> >> >> >>
> >> >> >> On 9/4/15, 9:03 AM, "Yuheng Du" <yu...@gmail.com> wrote:
> >> >> >>
> >> >> >> >When I using 32 partitions, the 4 brokers latency becomes larger
> >> >>than
> >> >> >>the
> >> >> >> >8
> >> >> >> >brokers latency.
> >> >> >> >
> >> >> >> >So is it always true that using more brokers can give less
> >>latency
> >> >>when
> >> >> >> >the
> >> >> >> >number of partitions is at least the size of the brokers?
> >> >> >> >
> >> >> >> >Thanks.
> >> >> >> >
> >> >> >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du
> >> >><yu...@gmail.com>
> >> >> >> >wrote:
> >> >> >> >
> >> >> >> >> I am running a producer latency test. When using 92 producers
> >>in
> >> >>92
> >> >> >> >> physical node publishing to 4 brokers, the latency is slightly
> >> >>lower
> >> >> >> >>than
> >> >> >> >> using 8 brokers, I am using 8 partitions for the topic.
> >> >> >> >>
> >> >> >> >> I have rerun the test and it gives me the same result, the 4
> >> >>brokers
> >> >> >> >> scenario still has lower latency than the 8 brokers scenarios.
> >> >> >> >>
> >> >> >> >> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8
> >> >> >>brokers,
> >> >> >> >>16
> >> >> >> >> brokers and 32 brokers. For the rest of the case the latency
> >> >> >>decreases
> >> >> >> >>as
> >> >> >> >> the number of brokers increase.
> >> >> >> >>
> >> >> >> >> 4 brokers/8 brokers is the only pair that doesn't satisfy this
> >> >>rule.
> >> >> >> >>What
> >> >> >> >> could be the cause?
> >> >> >> >>
> >> >> >> >> I am using a 200 bytes message, the test let each producer
> >> >>publishes
> >> >> >> >>500k
> >> >> >> >> messages to a given topic. Every test run when I change the
> >> >>number of
> >> >> >> >> brokers, I use a new topic.
> >> >> >> >>
> >> >> >> >> Thanks for any advices.
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >>
> >>
>
>

Re: latency test

Posted by "Helleren, Erik" <Er...@cmegroup.com>.

So, I did my own latency test on a cluster of 3 nodes, and there is a
significant difference around the 99%’ile and higher for partitions when
measuring the the ack time when configured for a single ack.  The graph
that I wish I could attach or post clearly shows that around 1/3 of the
partitions significantly diverge from the other two.  So, at least in my
case, one of my brokers is further than the others.
-Erik

On 9/4/15, 1:06 PM, "Yuheng Du" <yu...@gmail.com> wrote:

>No problem. Thanks for your advice. I think it would be fun to explore. I
>only know how to program in java though. Hope it will work.
>
>On Fri, Sep 4, 2015 at 2:03 PM, Helleren, Erik
><Er...@cmegroup.com>
>wrote:
>
>> I thing the suggestion is to have partitions/brokers >=1, so 32 should
>>be
>> enough.
>>
>> As for latency tests, there isn’t a lot of code to do a latency test.
>>If
>> you just want to measure ack time its around 100 lines.  I will try to
>> push out some good latency testing code to github, but my company is
>> scared of open sourcing code… so it might be a while…
>> -Erik
>>
>>
>> On 9/4/15, 12:55 PM, "Yuheng Du" <yu...@gmail.com> wrote:
>>
>> >Thanks for your reply Erik. I am running some more tests according to
>>your
>> >suggestions now and I will share with my results here. Is it necessary
>>to
>> >use a fixed number of partitions (32 partitions maybe) for my test?
>> >
>> >I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are
>>running
>> >on individual physical nodes. So I think using at least 32 partitions
>>will
>> >make more sense? I have seen latencies increase as the number of
>> >partitions
>> >goes up in my experiments.
>> >
>> >To get the latency of each event data recorded, are you suggesting
>>that I
>> >rewrite my own test program (in Java perhaps) or I can just modify the
>> >standard test program provided by kafka (
>> >https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I need
>>to
>> >rebuild the source if I modify the standard java test program
>> >ProducerPerformance provided in kafka, right? Now this standard program
>> >only has average latencies and percentile latencies but no per event
>> >latencies.
>> >
>> >Thanks.
>> >
>> >On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik
>> ><Er...@cmegroup.com>
>> >wrote:
>> >
>> >> That is an excellent question!  There are a bunch of ways to monitor
>> >> jitter and see when that is happening.  Here are a few:
>> >>
>> >> - You could slice the histogram every few seconds, save it out with a
>> >> timestamp, and then look at how they compare.  This would be mostly
>> >> manual, or you can graph line charts of the percentiles over time in
>> >>excel
>> >> where each percentile would be a series.  If you are using HDR
>> >>Histogram,
>> >> you should look at how to use the Recorder class to do this coupled
>> >>with a
>> >> ScheduledExecutorService.
>> >>
>> >> - You can just save the starting timestamp of the event and the
>>latency
>> >>of
>> >> each event.  If you put it into a CSV, you can just load it up into
>> >>excel
>> >> and graph as a XY chart.  That way you can see every point during the
>> >> running of your program and you can see trends.  You want to be
>>careful
>> >> about this one, especially of writing to a file in the callback that
>> >>kfaka
>> >> provides.
>> >>
>> >> Also, I have noticed that most of the very slow observations are at
>> >> startup.  But don’t trust me, trust the data and share your findings.
>> >> Also, having a 99.9 percentile provides a pretty good standard for
>> >>typical
>> >> poor case performance.  Average is borderline useless, 50%’ile is a
>> >>better
>> >> typical case because that’s the number that says “half of events
>>will be
>> >> this slow or faster”, or for values that are high like 99.9%’ile,
>>“0.1%
>> >>of
>> >> all events will be slower than this”.
>> >> -Erik
>> >>
>> >> On 9/4/15, 12:05 PM, "Yuheng Du" <yu...@gmail.com> wrote:
>> >>
>> >> >Thank you Erik! That's is helpful!
>> >> >
>> >> >But also I see jitters of the maximum latencies when running the
>> >> >experiment.
>> >> >
>> >> >The average end to acknowledgement latency from producer to broker
>>is
>> >> >around 5ms when using 92 producers and 4 brokers, and the 99.9
>> >>percentile
>> >> >latency is 58ms, but the maximum latency goes up to 1359 ms. How to
>> >>locate
>> >> >the source of this jitter?
>> >> >
>> >> >Thanks.
>> >> >
>> >> >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
>> >> ><Er...@cmegroup.com>
>> >> >wrote:
>> >> >
>> >> >> WellŠ not to be contrarian, but latency depends much more on the
>> >>latency
>> >> >> between the producer and the broker that is the leader for the
>> >>partition
>> >> >> you are publishing to.  At least when your brokers are not
>>saturated
>> >> >>with
>> >> >> messages, and acks are set to 1.  If acks are set to ALL, latency
>>on
>> >>an
>> >> >> non-saturated kafka cluster will be: Round Trip Latency from
>> >>producer to
>> >> >> leader for partition + Max( slowest Round Trip Latency to a
>>replicas
>> >>of
>> >> >> that partition).  If a cluster is saturated with messages, we
>>have to
>> >> >> assume that all partitions receive an equal distribution of
>>messages
>> >>to
>> >> >> avoid linear algebra and queueing theory models.  I don¹t like
>>linear
>> >> >> algebra :P
>> >> >>
>> >> >> Since you are probably putting all your latencies into a single
>> >> >>histogram
>> >> >> per producer, or worse, just an average, this pattern would have
>>been
>> >> >> obscured.  Obligatory lecture about measuring latency by Gil Tene
>> >> >> (https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this
>> >> >>hypothesis,
>> >> >> you should re-write the benchmark to plot the latencies for each
>> >>write
>> >> >>to
>> >> >> a partition for each producer into a histogram. (HRD histogram is
>> >>pretty
>> >> >> good for that).  This would give you producers*partitions
>>histograms,
>> >> >> which might be unwieldy for that many producers. But wait, there
>>is
>> >> >>hope!
>> >> >>
>> >> >> To verify that this hypothesis holds, you just have to see that
>>there
>> >> >>is a
>> >> >> significant difference between different partitions on a SINGLE
>> >> >>producing
>> >> >> client. So, pick one producing client at random and use the data
>>from
>> >> >> that. The easy way to do that is just plot all the partition
>>latency
>> >> >> histograms on top of each other in the same plot, that way you
>>have a
>> >> >> pretty plot to show people.  If you don¹t want to setup plotting,
>>you
>> >> >>can
>> >> >> just compare the medians (50¹th percentile) of the partitions¹
>> >> >>histograms.
>> >> >>  If there is a lot of variance, your latency anomaly is explained
>>by
>> >> >> brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot of
>> >> >>variance
>> >> >> at 50%, look at higher percentiles.  And if higher percentiles for
>> >>all
>> >> >>the
>> >> >> partitions look the same, this hypothesis is disproved.
>> >> >>
>> >> >> If you want to make a general statement about latency of writing
>>to
>> >> >>kafka,
>> >> >> you can merge all the histograms into a single histogram and plot
>> >>that.
>> >> >>
>> >> >> To Yuheng¹s credit, more brokers always results in more
>>throughput.
>> >>But
>> >> >> throughput and latency are two different creatures.  Its worth
>>noting
>> >> >>that
>> >> >> kafka is designed to be high throughput first and low latency
>>second.
>> >> >>And
>> >> >> it does a really good job at both.
>> >> >>
>> >> >> Disclaimer: I might not like linear algebra, but I do like
>> >>statistics.
>> >> >> Let me know if there are topics that need more explanation above
>>that
>> >> >> aren¹t covered by Gil¹s lecture.
>> >> >> -Erik
>> >> >>
>> >> >> On 9/4/15, 9:03 AM, "Yuheng Du" <yu...@gmail.com> wrote:
>> >> >>
>> >> >> >When I using 32 partitions, the 4 brokers latency becomes larger
>> >>than
>> >> >>the
>> >> >> >8
>> >> >> >brokers latency.
>> >> >> >
>> >> >> >So is it always true that using more brokers can give less
>>latency
>> >>when
>> >> >> >the
>> >> >> >number of partitions is at least the size of the brokers?
>> >> >> >
>> >> >> >Thanks.
>> >> >> >
>> >> >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du
>> >><yu...@gmail.com>
>> >> >> >wrote:
>> >> >> >
>> >> >> >> I am running a producer latency test. When using 92 producers
>>in
>> >>92
>> >> >> >> physical node publishing to 4 brokers, the latency is slightly
>> >>lower
>> >> >> >>than
>> >> >> >> using 8 brokers, I am using 8 partitions for the topic.
>> >> >> >>
>> >> >> >> I have rerun the test and it gives me the same result, the 4
>> >>brokers
>> >> >> >> scenario still has lower latency than the 8 brokers scenarios.
>> >> >> >>
>> >> >> >> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8
>> >> >>brokers,
>> >> >> >>16
>> >> >> >> brokers and 32 brokers. For the rest of the case the latency
>> >> >>decreases
>> >> >> >>as
>> >> >> >> the number of brokers increase.
>> >> >> >>
>> >> >> >> 4 brokers/8 brokers is the only pair that doesn't satisfy this
>> >>rule.
>> >> >> >>What
>> >> >> >> could be the cause?
>> >> >> >>
>> >> >> >> I am using a 200 bytes message, the test let each producer
>> >>publishes
>> >> >> >>500k
>> >> >> >> messages to a given topic. Every test run when I change the
>> >>number of
>> >> >> >> brokers, I use a new topic.
>> >> >> >>
>> >> >> >> Thanks for any advices.
>> >> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>

Re: latency test

Posted by Yuheng Du <yu...@gmail.com>.

No problem. Thanks for your advice. I think it would be fun to explore. I
only know how to program in java though. Hope it will work.

On Fri, Sep 4, 2015 at 2:03 PM, Helleren, Erik <Er...@cmegroup.com>
wrote:

> I thing the suggestion is to have partitions/brokers >=1, so 32 should be
> enough.
>
> As for latency tests, there isn’t a lot of code to do a latency test.  If
> you just want to measure ack time its around 100 lines.  I will try to
> push out some good latency testing code to github, but my company is
> scared of open sourcing code… so it might be a while…
> -Erik
>
>
> On 9/4/15, 12:55 PM, "Yuheng Du" <yu...@gmail.com> wrote:
>
> >Thanks for your reply Erik. I am running some more tests according to your
> >suggestions now and I will share with my results here. Is it necessary to
> >use a fixed number of partitions (32 partitions maybe) for my test?
> >
> >I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are running
> >on individual physical nodes. So I think using at least 32 partitions will
> >make more sense? I have seen latencies increase as the number of
> >partitions
> >goes up in my experiments.
> >
> >To get the latency of each event data recorded, are you suggesting that I
> >rewrite my own test program (in Java perhaps) or I can just modify the
> >standard test program provided by kafka (
> >https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I need to
> >rebuild the source if I modify the standard java test program
> >ProducerPerformance provided in kafka, right? Now this standard program
> >only has average latencies and percentile latencies but no per event
> >latencies.
> >
> >Thanks.
> >
> >On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik
> ><Er...@cmegroup.com>
> >wrote:
> >
> >> That is an excellent question!  There are a bunch of ways to monitor
> >> jitter and see when that is happening.  Here are a few:
> >>
> >> - You could slice the histogram every few seconds, save it out with a
> >> timestamp, and then look at how they compare.  This would be mostly
> >> manual, or you can graph line charts of the percentiles over time in
> >>excel
> >> where each percentile would be a series.  If you are using HDR
> >>Histogram,
> >> you should look at how to use the Recorder class to do this coupled
> >>with a
> >> ScheduledExecutorService.
> >>
> >> - You can just save the starting timestamp of the event and the latency
> >>of
> >> each event.  If you put it into a CSV, you can just load it up into
> >>excel
> >> and graph as a XY chart.  That way you can see every point during the
> >> running of your program and you can see trends.  You want to be careful
> >> about this one, especially of writing to a file in the callback that
> >>kfaka
> >> provides.
> >>
> >> Also, I have noticed that most of the very slow observations are at
> >> startup.  But don’t trust me, trust the data and share your findings.
> >> Also, having a 99.9 percentile provides a pretty good standard for
> >>typical
> >> poor case performance.  Average is borderline useless, 50%’ile is a
> >>better
> >> typical case because that’s the number that says “half of events will be
> >> this slow or faster”, or for values that are high like 99.9%’ile, “0.1%
> >>of
> >> all events will be slower than this”.
> >> -Erik
> >>
> >> On 9/4/15, 12:05 PM, "Yuheng Du" <yu...@gmail.com> wrote:
> >>
> >> >Thank you Erik! That's is helpful!
> >> >
> >> >But also I see jitters of the maximum latencies when running the
> >> >experiment.
> >> >
> >> >The average end to acknowledgement latency from producer to broker is
> >> >around 5ms when using 92 producers and 4 brokers, and the 99.9
> >>percentile
> >> >latency is 58ms, but the maximum latency goes up to 1359 ms. How to
> >>locate
> >> >the source of this jitter?
> >> >
> >> >Thanks.
> >> >
> >> >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
> >> ><Er...@cmegroup.com>
> >> >wrote:
> >> >
> >> >> WellŠ not to be contrarian, but latency depends much more on the
> >>latency
> >> >> between the producer and the broker that is the leader for the
> >>partition
> >> >> you are publishing to.  At least when your brokers are not saturated
> >> >>with
> >> >> messages, and acks are set to 1.  If acks are set to ALL, latency on
> >>an
> >> >> non-saturated kafka cluster will be: Round Trip Latency from
> >>producer to
> >> >> leader for partition + Max( slowest Round Trip Latency to a replicas
> >>of
> >> >> that partition).  If a cluster is saturated with messages, we have to
> >> >> assume that all partitions receive an equal distribution of messages
> >>to
> >> >> avoid linear algebra and queueing theory models.  I don¹t like linear
> >> >> algebra :P
> >> >>
> >> >> Since you are probably putting all your latencies into a single
> >> >>histogram
> >> >> per producer, or worse, just an average, this pattern would have been
> >> >> obscured.  Obligatory lecture about measuring latency by Gil Tene
> >> >> (https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this
> >> >>hypothesis,
> >> >> you should re-write the benchmark to plot the latencies for each
> >>write
> >> >>to
> >> >> a partition for each producer into a histogram. (HRD histogram is
> >>pretty
> >> >> good for that).  This would give you producers*partitions histograms,
> >> >> which might be unwieldy for that many producers. But wait, there is
> >> >>hope!
> >> >>
> >> >> To verify that this hypothesis holds, you just have to see that there
> >> >>is a
> >> >> significant difference between different partitions on a SINGLE
> >> >>producing
> >> >> client. So, pick one producing client at random and use the data from
> >> >> that. The easy way to do that is just plot all the partition latency
> >> >> histograms on top of each other in the same plot, that way you have a
> >> >> pretty plot to show people.  If you don¹t want to setup plotting, you
> >> >>can
> >> >> just compare the medians (50¹th percentile) of the partitions¹
> >> >>histograms.
> >> >>  If there is a lot of variance, your latency anomaly is explained by
> >> >> brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot of
> >> >>variance
> >> >> at 50%, look at higher percentiles.  And if higher percentiles for
> >>all
> >> >>the
> >> >> partitions look the same, this hypothesis is disproved.
> >> >>
> >> >> If you want to make a general statement about latency of writing to
> >> >>kafka,
> >> >> you can merge all the histograms into a single histogram and plot
> >>that.
> >> >>
> >> >> To Yuheng¹s credit, more brokers always results in more throughput.
> >>But
> >> >> throughput and latency are two different creatures.  Its worth noting
> >> >>that
> >> >> kafka is designed to be high throughput first and low latency second.
> >> >>And
> >> >> it does a really good job at both.
> >> >>
> >> >> Disclaimer: I might not like linear algebra, but I do like
> >>statistics.
> >> >> Let me know if there are topics that need more explanation above that
> >> >> aren¹t covered by Gil¹s lecture.
> >> >> -Erik
> >> >>
> >> >> On 9/4/15, 9:03 AM, "Yuheng Du" <yu...@gmail.com> wrote:
> >> >>
> >> >> >When I using 32 partitions, the 4 brokers latency becomes larger
> >>than
> >> >>the
> >> >> >8
> >> >> >brokers latency.
> >> >> >
> >> >> >So is it always true that using more brokers can give less latency
> >>when
> >> >> >the
> >> >> >number of partitions is at least the size of the brokers?
> >> >> >
> >> >> >Thanks.
> >> >> >
> >> >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du
> >><yu...@gmail.com>
> >> >> >wrote:
> >> >> >
> >> >> >> I am running a producer latency test. When using 92 producers in
> >>92
> >> >> >> physical node publishing to 4 brokers, the latency is slightly
> >>lower
> >> >> >>than
> >> >> >> using 8 brokers, I am using 8 partitions for the topic.
> >> >> >>
> >> >> >> I have rerun the test and it gives me the same result, the 4
> >>brokers
> >> >> >> scenario still has lower latency than the 8 brokers scenarios.
> >> >> >>
> >> >> >> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8
> >> >>brokers,
> >> >> >>16
> >> >> >> brokers and 32 brokers. For the rest of the case the latency
> >> >>decreases
> >> >> >>as
> >> >> >> the number of brokers increase.
> >> >> >>
> >> >> >> 4 brokers/8 brokers is the only pair that doesn't satisfy this
> >>rule.
> >> >> >>What
> >> >> >> could be the cause?
> >> >> >>
> >> >> >> I am using a 200 bytes message, the test let each producer
> >>publishes
> >> >> >>500k
> >> >> >> messages to a given topic. Every test run when I change the
> >>number of
> >> >> >> brokers, I use a new topic.
> >> >> >>
> >> >> >> Thanks for any advices.
> >> >> >>
> >> >>
> >> >>
> >>
> >>
>
>

Re: latency test

Posted by "Helleren, Erik" <Er...@cmegroup.com>.

I thing the suggestion is to have partitions/brokers >=1, so 32 should be
enough.  

As for latency tests, there isn’t a lot of code to do a latency test.  If
you just want to measure ack time its around 100 lines.  I will try to
push out some good latency testing code to github, but my company is
scared of open sourcing code… so it might be a while…
-Erik


On 9/4/15, 12:55 PM, "Yuheng Du" <yu...@gmail.com> wrote:

>Thanks for your reply Erik. I am running some more tests according to your
>suggestions now and I will share with my results here. Is it necessary to
>use a fixed number of partitions (32 partitions maybe) for my test?
>
>I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are running
>on individual physical nodes. So I think using at least 32 partitions will
>make more sense? I have seen latencies increase as the number of
>partitions
>goes up in my experiments.
>
>To get the latency of each event data recorded, are you suggesting that I
>rewrite my own test program (in Java perhaps) or I can just modify the
>standard test program provided by kafka (
>https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I need to
>rebuild the source if I modify the standard java test program
>ProducerPerformance provided in kafka, right? Now this standard program
>only has average latencies and percentile latencies but no per event
>latencies.
>
>Thanks.
>
>On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik
><Er...@cmegroup.com>
>wrote:
>
>> That is an excellent question!  There are a bunch of ways to monitor
>> jitter and see when that is happening.  Here are a few:
>>
>> - You could slice the histogram every few seconds, save it out with a
>> timestamp, and then look at how they compare.  This would be mostly
>> manual, or you can graph line charts of the percentiles over time in
>>excel
>> where each percentile would be a series.  If you are using HDR
>>Histogram,
>> you should look at how to use the Recorder class to do this coupled
>>with a
>> ScheduledExecutorService.
>>
>> - You can just save the starting timestamp of the event and the latency
>>of
>> each event.  If you put it into a CSV, you can just load it up into
>>excel
>> and graph as a XY chart.  That way you can see every point during the
>> running of your program and you can see trends.  You want to be careful
>> about this one, especially of writing to a file in the callback that
>>kfaka
>> provides.
>>
>> Also, I have noticed that most of the very slow observations are at
>> startup.  But don’t trust me, trust the data and share your findings.
>> Also, having a 99.9 percentile provides a pretty good standard for
>>typical
>> poor case performance.  Average is borderline useless, 50%’ile is a
>>better
>> typical case because that’s the number that says “half of events will be
>> this slow or faster”, or for values that are high like 99.9%’ile, “0.1%
>>of
>> all events will be slower than this”.
>> -Erik
>>
>> On 9/4/15, 12:05 PM, "Yuheng Du" <yu...@gmail.com> wrote:
>>
>> >Thank you Erik! That's is helpful!
>> >
>> >But also I see jitters of the maximum latencies when running the
>> >experiment.
>> >
>> >The average end to acknowledgement latency from producer to broker is
>> >around 5ms when using 92 producers and 4 brokers, and the 99.9
>>percentile
>> >latency is 58ms, but the maximum latency goes up to 1359 ms. How to
>>locate
>> >the source of this jitter?
>> >
>> >Thanks.
>> >
>> >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
>> ><Er...@cmegroup.com>
>> >wrote:
>> >
>> >> WellŠ not to be contrarian, but latency depends much more on the
>>latency
>> >> between the producer and the broker that is the leader for the
>>partition
>> >> you are publishing to.  At least when your brokers are not saturated
>> >>with
>> >> messages, and acks are set to 1.  If acks are set to ALL, latency on
>>an
>> >> non-saturated kafka cluster will be: Round Trip Latency from
>>producer to
>> >> leader for partition + Max( slowest Round Trip Latency to a replicas
>>of
>> >> that partition).  If a cluster is saturated with messages, we have to
>> >> assume that all partitions receive an equal distribution of messages
>>to
>> >> avoid linear algebra and queueing theory models.  I don¹t like linear
>> >> algebra :P
>> >>
>> >> Since you are probably putting all your latencies into a single
>> >>histogram
>> >> per producer, or worse, just an average, this pattern would have been
>> >> obscured.  Obligatory lecture about measuring latency by Gil Tene
>> >> (https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this
>> >>hypothesis,
>> >> you should re-write the benchmark to plot the latencies for each
>>write
>> >>to
>> >> a partition for each producer into a histogram. (HRD histogram is
>>pretty
>> >> good for that).  This would give you producers*partitions histograms,
>> >> which might be unwieldy for that many producers. But wait, there is
>> >>hope!
>> >>
>> >> To verify that this hypothesis holds, you just have to see that there
>> >>is a
>> >> significant difference between different partitions on a SINGLE
>> >>producing
>> >> client. So, pick one producing client at random and use the data from
>> >> that. The easy way to do that is just plot all the partition latency
>> >> histograms on top of each other in the same plot, that way you have a
>> >> pretty plot to show people.  If you don¹t want to setup plotting, you
>> >>can
>> >> just compare the medians (50¹th percentile) of the partitions¹
>> >>histograms.
>> >>  If there is a lot of variance, your latency anomaly is explained by
>> >> brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot of
>> >>variance
>> >> at 50%, look at higher percentiles.  And if higher percentiles for
>>all
>> >>the
>> >> partitions look the same, this hypothesis is disproved.
>> >>
>> >> If you want to make a general statement about latency of writing to
>> >>kafka,
>> >> you can merge all the histograms into a single histogram and plot
>>that.
>> >>
>> >> To Yuheng¹s credit, more brokers always results in more throughput.
>>But
>> >> throughput and latency are two different creatures.  Its worth noting
>> >>that
>> >> kafka is designed to be high throughput first and low latency second.
>> >>And
>> >> it does a really good job at both.
>> >>
>> >> Disclaimer: I might not like linear algebra, but I do like
>>statistics.
>> >> Let me know if there are topics that need more explanation above that
>> >> aren¹t covered by Gil¹s lecture.
>> >> -Erik
>> >>
>> >> On 9/4/15, 9:03 AM, "Yuheng Du" <yu...@gmail.com> wrote:
>> >>
>> >> >When I using 32 partitions, the 4 brokers latency becomes larger
>>than
>> >>the
>> >> >8
>> >> >brokers latency.
>> >> >
>> >> >So is it always true that using more brokers can give less latency
>>when
>> >> >the
>> >> >number of partitions is at least the size of the brokers?
>> >> >
>> >> >Thanks.
>> >> >
>> >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du
>><yu...@gmail.com>
>> >> >wrote:
>> >> >
>> >> >> I am running a producer latency test. When using 92 producers in
>>92
>> >> >> physical node publishing to 4 brokers, the latency is slightly
>>lower
>> >> >>than
>> >> >> using 8 brokers, I am using 8 partitions for the topic.
>> >> >>
>> >> >> I have rerun the test and it gives me the same result, the 4
>>brokers
>> >> >> scenario still has lower latency than the 8 brokers scenarios.
>> >> >>
>> >> >> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8
>> >>brokers,
>> >> >>16
>> >> >> brokers and 32 brokers. For the rest of the case the latency
>> >>decreases
>> >> >>as
>> >> >> the number of brokers increase.
>> >> >>
>> >> >> 4 brokers/8 brokers is the only pair that doesn't satisfy this
>>rule.
>> >> >>What
>> >> >> could be the cause?
>> >> >>
>> >> >> I am using a 200 bytes message, the test let each producer
>>publishes
>> >> >>500k
>> >> >> messages to a given topic. Every test run when I change the
>>number of
>> >> >> brokers, I use a new topic.
>> >> >>
>> >> >> Thanks for any advices.
>> >> >>
>> >>
>> >>
>>
>>

Re: latency test

Posted by Yuheng Du <yu...@gmail.com>.

Thanks for your reply Erik. I am running some more tests according to your
suggestions now and I will share with my results here. Is it necessary to
use a fixed number of partitions (32 partitions maybe) for my test?

I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are running
on individual physical nodes. So I think using at least 32 partitions will
make more sense? I have seen latencies increase as the number of partitions
goes up in my experiments.

To get the latency of each event data recorded, are you suggesting that I
rewrite my own test program (in Java perhaps) or I can just modify the
standard test program provided by kafka (
https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I need to
rebuild the source if I modify the standard java test program
ProducerPerformance provided in kafka, right? Now this standard program
only has average latencies and percentile latencies but no per event
latencies.

Thanks.

On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik <Er...@cmegroup.com>
wrote:

> That is an excellent question!  There are a bunch of ways to monitor
> jitter and see when that is happening.  Here are a few:
>
> - You could slice the histogram every few seconds, save it out with a
> timestamp, and then look at how they compare.  This would be mostly
> manual, or you can graph line charts of the percentiles over time in excel
> where each percentile would be a series.  If you are using HDR Histogram,
> you should look at how to use the Recorder class to do this coupled with a
> ScheduledExecutorService.
>
> - You can just save the starting timestamp of the event and the latency of
> each event.  If you put it into a CSV, you can just load it up into excel
> and graph as a XY chart.  That way you can see every point during the
> running of your program and you can see trends.  You want to be careful
> about this one, especially of writing to a file in the callback that kfaka
> provides.
>
> Also, I have noticed that most of the very slow observations are at
> startup.  But don’t trust me, trust the data and share your findings.
> Also, having a 99.9 percentile provides a pretty good standard for typical
> poor case performance.  Average is borderline useless, 50%’ile is a better
> typical case because that’s the number that says “half of events will be
> this slow or faster”, or for values that are high like 99.9%’ile, “0.1% of
> all events will be slower than this”.
> -Erik
>
> On 9/4/15, 12:05 PM, "Yuheng Du" <yu...@gmail.com> wrote:
>
> >Thank you Erik! That's is helpful!
> >
> >But also I see jitters of the maximum latencies when running the
> >experiment.
> >
> >The average end to acknowledgement latency from producer to broker is
> >around 5ms when using 92 producers and 4 brokers, and the 99.9 percentile
> >latency is 58ms, but the maximum latency goes up to 1359 ms. How to locate
> >the source of this jitter?
> >
> >Thanks.
> >
> >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
> ><Er...@cmegroup.com>
> >wrote:
> >
> >> WellŠ not to be contrarian, but latency depends much more on the latency
> >> between the producer and the broker that is the leader for the partition
> >> you are publishing to.  At least when your brokers are not saturated
> >>with
> >> messages, and acks are set to 1.  If acks are set to ALL, latency on an
> >> non-saturated kafka cluster will be: Round Trip Latency from producer to
> >> leader for partition + Max( slowest Round Trip Latency to a replicas of
> >> that partition).  If a cluster is saturated with messages, we have to
> >> assume that all partitions receive an equal distribution of messages to
> >> avoid linear algebra and queueing theory models.  I don¹t like linear
> >> algebra :P
> >>
> >> Since you are probably putting all your latencies into a single
> >>histogram
> >> per producer, or worse, just an average, this pattern would have been
> >> obscured.  Obligatory lecture about measuring latency by Gil Tene
> >> (https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this
> >>hypothesis,
> >> you should re-write the benchmark to plot the latencies for each write
> >>to
> >> a partition for each producer into a histogram. (HRD histogram is pretty
> >> good for that).  This would give you producers*partitions histograms,
> >> which might be unwieldy for that many producers. But wait, there is
> >>hope!
> >>
> >> To verify that this hypothesis holds, you just have to see that there
> >>is a
> >> significant difference between different partitions on a SINGLE
> >>producing
> >> client. So, pick one producing client at random and use the data from
> >> that. The easy way to do that is just plot all the partition latency
> >> histograms on top of each other in the same plot, that way you have a
> >> pretty plot to show people.  If you don¹t want to setup plotting, you
> >>can
> >> just compare the medians (50¹th percentile) of the partitions¹
> >>histograms.
> >>  If there is a lot of variance, your latency anomaly is explained by
> >> brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot of
> >>variance
> >> at 50%, look at higher percentiles.  And if higher percentiles for all
> >>the
> >> partitions look the same, this hypothesis is disproved.
> >>
> >> If you want to make a general statement about latency of writing to
> >>kafka,
> >> you can merge all the histograms into a single histogram and plot that.
> >>
> >> To Yuheng¹s credit, more brokers always results in more throughput. But
> >> throughput and latency are two different creatures.  Its worth noting
> >>that
> >> kafka is designed to be high throughput first and low latency second.
> >>And
> >> it does a really good job at both.
> >>
> >> Disclaimer: I might not like linear algebra, but I do like statistics.
> >> Let me know if there are topics that need more explanation above that
> >> aren¹t covered by Gil¹s lecture.
> >> -Erik
> >>
> >> On 9/4/15, 9:03 AM, "Yuheng Du" <yu...@gmail.com> wrote:
> >>
> >> >When I using 32 partitions, the 4 brokers latency becomes larger than
> >>the
> >> >8
> >> >brokers latency.
> >> >
> >> >So is it always true that using more brokers can give less latency when
> >> >the
> >> >number of partitions is at least the size of the brokers?
> >> >
> >> >Thanks.
> >> >
> >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du <yu...@gmail.com>
> >> >wrote:
> >> >
> >> >> I am running a producer latency test. When using 92 producers in 92
> >> >> physical node publishing to 4 brokers, the latency is slightly lower
> >> >>than
> >> >> using 8 brokers, I am using 8 partitions for the topic.
> >> >>
> >> >> I have rerun the test and it gives me the same result, the 4 brokers
> >> >> scenario still has lower latency than the 8 brokers scenarios.
> >> >>
> >> >> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8
> >>brokers,
> >> >>16
> >> >> brokers and 32 brokers. For the rest of the case the latency
> >>decreases
> >> >>as
> >> >> the number of brokers increase.
> >> >>
> >> >> 4 brokers/8 brokers is the only pair that doesn't satisfy this rule.
> >> >>What
> >> >> could be the cause?
> >> >>
> >> >> I am using a 200 bytes message, the test let each producer publishes
> >> >>500k
> >> >> messages to a given topic. Every test run when I change the number of
> >> >> brokers, I use a new topic.
> >> >>
> >> >> Thanks for any advices.
> >> >>
> >>
> >>
>
>

Re: latency test

Posted by "Helleren, Erik" <Er...@cmegroup.com>.

That is an excellent question!  There are a bunch of ways to monitor
jitter and see when that is happening.  Here are a few:

- You could slice the histogram every few seconds, save it out with a
timestamp, and then look at how they compare.  This would be mostly
manual, or you can graph line charts of the percentiles over time in excel
where each percentile would be a series.  If you are using HDR Histogram,
you should look at how to use the Recorder class to do this coupled with a
ScheduledExecutorService.

- You can just save the starting timestamp of the event and the latency of
each event.  If you put it into a CSV, you can just load it up into excel
and graph as a XY chart.  That way you can see every point during the
running of your program and you can see trends.  You want to be careful
about this one, especially of writing to a file in the callback that kfaka
provides.  

Also, I have noticed that most of the very slow observations are at
startup.  But don’t trust me, trust the data and share your findings.
Also, having a 99.9 percentile provides a pretty good standard for typical
poor case performance.  Average is borderline useless, 50%’ile is a better
typical case because that’s the number that says “half of events will be
this slow or faster”, or for values that are high like 99.9%’ile, “0.1% of
all events will be slower than this”.
-Erik 

On 9/4/15, 12:05 PM, "Yuheng Du" <yu...@gmail.com> wrote:

>Thank you Erik! That's is helpful!
>
>But also I see jitters of the maximum latencies when running the
>experiment.
>
>The average end to acknowledgement latency from producer to broker is
>around 5ms when using 92 producers and 4 brokers, and the 99.9 percentile
>latency is 58ms, but the maximum latency goes up to 1359 ms. How to locate
>the source of this jitter?
>
>Thanks.
>
>On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
><Er...@cmegroup.com>
>wrote:
>
>> WellŠ not to be contrarian, but latency depends much more on the latency
>> between the producer and the broker that is the leader for the partition
>> you are publishing to.  At least when your brokers are not saturated
>>with
>> messages, and acks are set to 1.  If acks are set to ALL, latency on an
>> non-saturated kafka cluster will be: Round Trip Latency from producer to
>> leader for partition + Max( slowest Round Trip Latency to a replicas of
>> that partition).  If a cluster is saturated with messages, we have to
>> assume that all partitions receive an equal distribution of messages to
>> avoid linear algebra and queueing theory models.  I don¹t like linear
>> algebra :P
>>
>> Since you are probably putting all your latencies into a single
>>histogram
>> per producer, or worse, just an average, this pattern would have been
>> obscured.  Obligatory lecture about measuring latency by Gil Tene
>> (https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this
>>hypothesis,
>> you should re-write the benchmark to plot the latencies for each write
>>to
>> a partition for each producer into a histogram. (HRD histogram is pretty
>> good for that).  This would give you producers*partitions histograms,
>> which might be unwieldy for that many producers. But wait, there is
>>hope!
>>
>> To verify that this hypothesis holds, you just have to see that there
>>is a
>> significant difference between different partitions on a SINGLE
>>producing
>> client. So, pick one producing client at random and use the data from
>> that. The easy way to do that is just plot all the partition latency
>> histograms on top of each other in the same plot, that way you have a
>> pretty plot to show people.  If you don¹t want to setup plotting, you
>>can
>> just compare the medians (50¹th percentile) of the partitions¹
>>histograms.
>>  If there is a lot of variance, your latency anomaly is explained by
>> brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot of
>>variance
>> at 50%, look at higher percentiles.  And if higher percentiles for all
>>the
>> partitions look the same, this hypothesis is disproved.
>>
>> If you want to make a general statement about latency of writing to
>>kafka,
>> you can merge all the histograms into a single histogram and plot that.
>>
>> To Yuheng¹s credit, more brokers always results in more throughput. But
>> throughput and latency are two different creatures.  Its worth noting
>>that
>> kafka is designed to be high throughput first and low latency second.
>>And
>> it does a really good job at both.
>>
>> Disclaimer: I might not like linear algebra, but I do like statistics.
>> Let me know if there are topics that need more explanation above that
>> aren¹t covered by Gil¹s lecture.
>> -Erik
>>
>> On 9/4/15, 9:03 AM, "Yuheng Du" <yu...@gmail.com> wrote:
>>
>> >When I using 32 partitions, the 4 brokers latency becomes larger than
>>the
>> >8
>> >brokers latency.
>> >
>> >So is it always true that using more brokers can give less latency when
>> >the
>> >number of partitions is at least the size of the brokers?
>> >
>> >Thanks.
>> >
>> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du <yu...@gmail.com>
>> >wrote:
>> >
>> >> I am running a producer latency test. When using 92 producers in 92
>> >> physical node publishing to 4 brokers, the latency is slightly lower
>> >>than
>> >> using 8 brokers, I am using 8 partitions for the topic.
>> >>
>> >> I have rerun the test and it gives me the same result, the 4 brokers
>> >> scenario still has lower latency than the 8 brokers scenarios.
>> >>
>> >> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8
>>brokers,
>> >>16
>> >> brokers and 32 brokers. For the rest of the case the latency
>>decreases
>> >>as
>> >> the number of brokers increase.
>> >>
>> >> 4 brokers/8 brokers is the only pair that doesn't satisfy this rule.
>> >>What
>> >> could be the cause?
>> >>
>> >> I am using a 200 bytes message, the test let each producer publishes
>> >>500k
>> >> messages to a given topic. Every test run when I change the number of
>> >> brokers, I use a new topic.
>> >>
>> >> Thanks for any advices.
>> >>
>>
>>

Re: latency test

Posted by Yuheng Du <yu...@gmail.com>.

Thank you Erik! That's is helpful!

But also I see jitters of the maximum latencies when running the
experiment.

The average end to acknowledgement latency from producer to broker is
around 5ms when using 92 producers and 4 brokers, and the 99.9 percentile
latency is 58ms, but the maximum latency goes up to 1359 ms. How to locate
the source of this jitter?

Thanks.

On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik <Er...@cmegroup.com>
wrote:

> WellŠ not to be contrarian, but latency depends much more on the latency
> between the producer and the broker that is the leader for the partition
> you are publishing to.  At least when your brokers are not saturated with
> messages, and acks are set to 1.  If acks are set to ALL, latency on an
> non-saturated kafka cluster will be: Round Trip Latency from producer to
> leader for partition + Max( slowest Round Trip Latency to a replicas of
> that partition).  If a cluster is saturated with messages, we have to
> assume that all partitions receive an equal distribution of messages to
> avoid linear algebra and queueing theory models.  I don¹t like linear
> algebra :P
>
> Since you are probably putting all your latencies into a single histogram
> per producer, or worse, just an average, this pattern would have been
> obscured.  Obligatory lecture about measuring latency by Gil Tene
> (https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this hypothesis,
> you should re-write the benchmark to plot the latencies for each write to
> a partition for each producer into a histogram. (HRD histogram is pretty
> good for that).  This would give you producers*partitions histograms,
> which might be unwieldy for that many producers. But wait, there is hope!
>
> To verify that this hypothesis holds, you just have to see that there is a
> significant difference between different partitions on a SINGLE producing
> client. So, pick one producing client at random and use the data from
> that. The easy way to do that is just plot all the partition latency
> histograms on top of each other in the same plot, that way you have a
> pretty plot to show people.  If you don¹t want to setup plotting, you can
> just compare the medians (50¹th percentile) of the partitions¹ histograms.
>  If there is a lot of variance, your latency anomaly is explained by
> brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot of variance
> at 50%, look at higher percentiles.  And if higher percentiles for all the
> partitions look the same, this hypothesis is disproved.
>
> If you want to make a general statement about latency of writing to kafka,
> you can merge all the histograms into a single histogram and plot that.
>
> To Yuheng¹s credit, more brokers always results in more throughput. But
> throughput and latency are two different creatures.  Its worth noting that
> kafka is designed to be high throughput first and low latency second.  And
> it does a really good job at both.
>
> Disclaimer: I might not like linear algebra, but I do like statistics.
> Let me know if there are topics that need more explanation above that
> aren¹t covered by Gil¹s lecture.
> -Erik
>
> On 9/4/15, 9:03 AM, "Yuheng Du" <yu...@gmail.com> wrote:
>
> >When I using 32 partitions, the 4 brokers latency becomes larger than the
> >8
> >brokers latency.
> >
> >So is it always true that using more brokers can give less latency when
> >the
> >number of partitions is at least the size of the brokers?
> >
> >Thanks.
> >
> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du <yu...@gmail.com>
> >wrote:
> >
> >> I am running a producer latency test. When using 92 producers in 92
> >> physical node publishing to 4 brokers, the latency is slightly lower
> >>than
> >> using 8 brokers, I am using 8 partitions for the topic.
> >>
> >> I have rerun the test and it gives me the same result, the 4 brokers
> >> scenario still has lower latency than the 8 brokers scenarios.
> >>
> >> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8 brokers,
> >>16
> >> brokers and 32 brokers. For the rest of the case the latency decreases
> >>as
> >> the number of brokers increase.
> >>
> >> 4 brokers/8 brokers is the only pair that doesn't satisfy this rule.
> >>What
> >> could be the cause?
> >>
> >> I am using a 200 bytes message, the test let each producer publishes
> >>500k
> >> messages to a given topic. Every test run when I change the number of
> >> brokers, I use a new topic.
> >>
> >> Thanks for any advices.
> >>
>
>

Re: latency test

Posted by "Helleren, Erik" <Er...@cmegroup.com>.

WellŠ not to be contrarian, but latency depends much more on the latency
between the producer and the broker that is the leader for the partition
you are publishing to.  At least when your brokers are not saturated with
messages, and acks are set to 1.  If acks are set to ALL, latency on an
non-saturated kafka cluster will be: Round Trip Latency from producer to
leader for partition + Max( slowest Round Trip Latency to a replicas of
that partition).  If a cluster is saturated with messages, we have to
assume that all partitions receive an equal distribution of messages to
avoid linear algebra and queueing theory models.  I don¹t like linear
algebra :P  

Since you are probably putting all your latencies into a single histogram
per producer, or worse, just an average, this pattern would have been
obscured.  Obligatory lecture about measuring latency by Gil Tene
(https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this hypothesis,
you should re-write the benchmark to plot the latencies for each write to
a partition for each producer into a histogram. (HRD histogram is pretty
good for that).  This would give you producers*partitions histograms,
which might be unwieldy for that many producers. But wait, there is hope!

To verify that this hypothesis holds, you just have to see that there is a
significant difference between different partitions on a SINGLE producing
client. So, pick one producing client at random and use the data from
that. The easy way to do that is just plot all the partition latency
histograms on top of each other in the same plot, that way you have a
pretty plot to show people.  If you don¹t want to setup plotting, you can
just compare the medians (50¹th percentile) of the partitions¹ histograms.
 If there is a lot of variance, your latency anomaly is explained by
brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot of variance
at 50%, look at higher percentiles.  And if higher percentiles for all the
partitions look the same, this hypothesis is disproved.

If you want to make a general statement about latency of writing to kafka,
you can merge all the histograms into a single histogram and plot that.

To Yuheng¹s credit, more brokers always results in more throughput. But
throughput and latency are two different creatures.  Its worth noting that
kafka is designed to be high throughput first and low latency second.  And
it does a really good job at both.

Disclaimer: I might not like linear algebra, but I do like statistics.
Let me know if there are topics that need more explanation above that
aren¹t covered by Gil¹s lecture.
-Erik

On 9/4/15, 9:03 AM, "Yuheng Du" <yu...@gmail.com> wrote:

>When I using 32 partitions, the 4 brokers latency becomes larger than the
>8
>brokers latency.
>
>So is it always true that using more brokers can give less latency when
>the
>number of partitions is at least the size of the brokers?
>
>Thanks.
>
>On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du <yu...@gmail.com>
>wrote:
>
>> I am running a producer latency test. When using 92 producers in 92
>> physical node publishing to 4 brokers, the latency is slightly lower
>>than
>> using 8 brokers, I am using 8 partitions for the topic.
>>
>> I have rerun the test and it gives me the same result, the 4 brokers
>> scenario still has lower latency than the 8 brokers scenarios.
>>
>> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8 brokers,
>>16
>> brokers and 32 brokers. For the rest of the case the latency decreases
>>as
>> the number of brokers increase.
>>
>> 4 brokers/8 brokers is the only pair that doesn't satisfy this rule.
>>What
>> could be the cause?
>>
>> I am using a 200 bytes message, the test let each producer publishes
>>500k
>> messages to a given topic. Every test run when I change the number of
>> brokers, I use a new topic.
>>
>> Thanks for any advices.
>>

Re: latency test

Posted by Yuheng Du <yu...@gmail.com>.

When I using 32 partitions, the 4 brokers latency becomes larger than the 8
brokers latency.

So is it always true that using more brokers can give less latency when the
number of partitions is at least the size of the brokers?

Thanks.

On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du <yu...@gmail.com> wrote:

> I am running a producer latency test. When using 92 producers in 92
> physical node publishing to 4 brokers, the latency is slightly lower than
> using 8 brokers, I am using 8 partitions for the topic.
>
> I have rerun the test and it gives me the same result, the 4 brokers
> scenario still has lower latency than the 8 brokers scenarios.
>
> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8 brokers, 16
> brokers and 32 brokers. For the rest of the case the latency decreases as
> the number of brokers increase.
>
> 4 brokers/8 brokers is the only pair that doesn't satisfy this rule. What
> could be the cause?
>
> I am using a 200 bytes message, the test let each producer publishes 500k
> messages to a given topic. Every test run when I change the number of
> brokers, I use a new topic.
>
> Thanks for any advices.
>