You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Garry Turkington <g....@improvedigital.com> on 2015/05/13 23:40:12 UTC

Experiences testing new producer performance across multiple threads/producer counts

Hi,

I talked with Gwen at Strata last week and promised to share some of my experiences benchmarking an app reliant on the new  producer. I'm using relatively meaty boxes running my producer code (24 core/64GB RAM) but I wasn't pushing them until I got them on the same 10GB fabric as the Kafka cluster they are using (saturating the prior 1GB NICs was just too easy). There are 5 brokers, 24 core/192GB RAM/8*2TB disks, running 0.8.2.1.

With lots of cores and a dedicated box the question was then how to deploy my application. In particular how many worker threads and how many instances of the KafkaProducer  to share amongst them. I also wanted to see how things would change as I scale up the thread count.

I ripped out the data retrieval part of my app (it reads from S3) and instead replaced it with some code to produce random records of average size 500 bytes but varying between 250 and 750. I started the app running, ignored the first 25m messages then measured the timing for the next 100m and  calculated the average messages/sec written to Kafka across that run.

Starting small I created 4 application threads with a range of approaches to sharing KafkaProducer instances. The records written to the Kafka cluster per second were as follows:

4 threads all sharing 1 client: 332K
4 threads sharing 2 clients: 333K
4 threads, dedicated client per thread: 310K

Note that when I had 2 KafkaProducer clients as in the second line above each was used by 2 threads. Similar below, number of threads/number of clients is the max number of threads per KafkaProducer instance.

As can be seen from the above there's not much in it. Scaling up to 8 application threads the numbers  looked like:

8 threads sharing 1 client: 387K
8 threads sharing 2 clients: 609K
8 threads sharing 4 clients: 628K
8 threads with dedicated  client per thread: 527K

This time sharing a single producer client  across all threads has by far the worse performance and  isn't much better than when using 4 application threads. The 2 and 4 client options are much better and are in the ballpark of 2x the 4 thread performance. A dedicated client per thread isn't quite as good but isn't so far off to be unusable. So then taking it to 16 application threads:

16 threads sharing 1 client: 380K
16 threads sharing 2 clients: 675K
16 threads sharing 4 clients: 869K
16 threads sharing 8 clients: 733K
16 threads  with a dedicated client per thread: 491K

This gives a much clearer performance curve. The 16 thread/4 producer client is by far the best performance but it is still far from 4x the 4-thread or 2x the 8-thread mark. At this point I seem to be hitting some limiting factor. On the client machine memory was still lightly used, network was peaking just over 4GB/sec but CPU load was showing 1 minute load averages around 18-20. CPU load did seem to increase with as did the number of KafkaProducer instances but that is more a conclusion from memory and not backed by hard numbers.

For completeness sake I did do a 24 thread test but the numbers are as you'd expect. 1 client and 24 both showed poor performance. 4,6 or 8 clients (24 has more  ways of dividing it by 2!) all showed performance around that of the 16 thread/4 client run above. The other configs were in-between.

With my full application I've found the best deployment so far is to have   multiple instances running on the same box. I can get much better performance from 3 instances each with 8 threads than 1 instance with 24 threads. This is almost certainly because when adding in my own application logic and the AWS clients there is just a lot more contention - not to mention much more i/o waits -- in each application instance. The benchmark variant doesn't have as much happening but just to compare I ran a few concurrent instances:

2 copies of 8 threads sharing 4 clients: 780K total
2 copies of 8 threads sharing 2 clients: 870K total
3 copies of 8 threads sharing 2 clients: 945k total

So bottom line - around 900K/sec is the max I can get from one of these hosts for my application. At which point I brought a 2nd host to bear and ran 2 concurrent instances of the best performing config on each:

2 copies of 16 threads sharing 4 clients on 2 hosts: 1458k total

This doesn't quite give 2x the single box performance but it does show that the cluster has capacity to spare beyond what the single client host can drive. This was also backed up by the metrics on the brokers, they got busy but moderately so given the amount of work they were doing.

At this point things  did get a bit 'ragged edge'. I noticed a very high rate of ISR churn on the brokers, it looked like the replicas were having trouble keeping up with the master and hosts were constantly being dropped out then re-added to the ISR. I had set the test topic to have a relatively low partition count (1 per spindle) so I doubled that to see if it could help the ISRs remain  stable. And my performance fell through the floor. So whereas I thought this was an equation involving application threads and producer instances perhaps partition count is a third. I need look into that some more but so far it looks like that for my application - I'm not suggesting this is a universal truth -- sharing a KafkaProducer instance amongst around 4 threads is the sweet spot.

I'll be doing further profiling of my application so I'll flag to the list anything that appears within the producer itself. And because 900K messages/sec was so close to a significant number I modified my code that generates the messages to keep the key random for each message but to use repeated message bodies across multiple messages. At which point 1.05m messages/sec was possible - from a single box. Nice. :)

This turned out much longer than planned, I probably should have blogged this somewhere. If anyone reads this far hope it is useful or of interest, I'd be interested in hearing if the profiles I'm seeing are  expected and if any other tests would be useful.

Regards
Garry

Re: Experiences testing new producer performance across multiple threads/producer counts

Posted by tao xiao <xi...@gmail.com>.
Garry,

Do you mind to share the source code that you did for the profiling?

On Sun, May 17, 2015 at 4:59 PM, Garry Turkington <
g.turkington@improvedigital.com> wrote:

> Hi Guozhang/Jay/Becket,
>
> Thanks for the responses.
>
> Regarding my point on performance dropping when the number of partitions
> was increased, that surprised me too as on another cluster I had done just
> this to help with the issue of lots of ISR churn and it had been a straight
> win.
>
> I mentioned in my last mail that I had simplified the code to generate
> test messages with the effect that it greatly reduced the CPU load per
> thread. After doing this the performance on the higher partition-count
> topic was consistent with the lower partition count one and showed no
> degredation. So the sender threads were becoming CPU bound, I'm assuming
> possibly due to the additional locks involved with more partitions but that
> needs validation.
>
> I've been running my clients with acks=1, linger.ms floating between 0
> and 1 because I want to convince myself of it making a difference but so
> far I've not really seen it and similar to Jay's experiences settled on 64K
> for batch.size because I just didn't see any benefit of anything beyond
> that and even the jump from 32K wasn't proved beneficial. For this
> particular application I've already hit the needed performance (around
> 700K/sec at peak) but my workload can be quite a sawtooth moving from peak
> to idle and back again. So peak becomes the new norm and understanding the
> head-room in the current setup and how to grow beyond that is important.
>
> I've had a few more test boxes put on the same 10GB network as the cluster
> in question so I'll re-visit this and do deeper profiling over the next
> week and will revert here with findings.
>
> Regards
> Garry
>
> -----Original Message-----
> From: Guozhang Wang [mailto:wangguoz@gmail.com]
> Sent: 14 May 2015 18:57
> To: users@kafka.apache.org
> Subject: Re: Experiences testing new producer performance across multiple
> threads/producer counts
>
> Regarding the issue that adding more partitions kill the performance: I
> would suspect it maybe due to not-sufficient batching. Note that in the new
> producer batching is done per-partition, and if linger.ms setting low,
> partition data may not be batched enough before they got sent to the
> brokers. Also since the new producer will drain all partitions that belongs
> to the same broker, when one of them hits either linger time or batch size,
> when you only have one or a few brokers this will further exaggerate the
> not-sufficient-batching issue. So monitoring on average batch size would be
> a good idea.
>
> Guozhang
>
> On Wed, May 13, 2015 at 7:47 PM, Jay Kreps <ja...@gmail.com> wrote:
>
> > Hey Garry,
> >
> > Super interesting. We honestly never did a ton of performance tuning
> > on the producer. I checked the profiles early on in development and we
> > fixed a few issues that popped up in deployment, but I don't think
> > anyone has done a really scientific look. If you (or anyone else) want
> > to dive into things I suspect it could be improved.
> >
> > Becket is exactly right. There are two possible bottlenecks you can
> > hit in the producer--the single background sender thread and the
> > per-partition lock. You can check utilization on the background thread
> > with jvisualvm (it's named something like
> > kafka-producer-network-thread). The locking is fairly hard to improve.
> >
> > It's a little surprising that adding partitions caused a large
> > decrease in performance. Generally this is only the case if you
> > override the flush settings on the broker to force fsync more frequently.
> >
> > The ISR issues under heavy load are probably fixable, the issue is
> > discussed a bit here:
> >
> > http://blog.confluent.io/2015/04/07/hands-free-kafka-replication-a-les
> > son-in-operational-simplicity/
> >
> > The producer settings that may matter for performance are:
> > acks
> > batch.size (though beyond 32k I didn't see much improvement) linger.ms
> > (setting >= 1 may help a bit) send.buffer.bytes (maybe, but probably
> > not)
> >
> > Cheers,
> >
> > -Jay
> >
> > On Wed, May 13, 2015 at 3:42 PM, Jiangjie Qin
> > <jq...@linkedin.com.invalid>
> > wrote:
> >
> > > Thanks for sharing this, Garry. I actually did similar tests before
> > > but unfortunately lost the test data because my laptop rebooted and
> > > I forgot to save the dataŠ
> > >
> > > Anyway, several things to verify:
> > >
> > > 1. Remember KafkaProducer holds lock per partition. So if you have
> > > only one partition in the target topic and many application threads.
> > > Lock contention could be an issue.
> > >
> > > 2. It matters that how frequent the sender thread wake up and runs.
> > > You can take a look at the following sensors to further verify
> > > whether the sender thread really become a bottleneck or not:
> > > Select-rate
> > > Io-wait-time-ns-avg
> > > Io-time-ns-avg
> > >
> > > 3. Batch size matters, so take a look at the sensor batch-size-avg
> > > and
> > see
> > > if the average batch size makes sense or not.
> > >
> > > Looking forward to your further profiling. My thinking is that
> > > unless you are sending very small messages to a small number of
> > > partitions. You
> > don¹t
> > > need to worry about to use more than one producer.
> > >
> > > Thanks.
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > > On 5/13/15, 2:40 PM, "Garry Turkington"
> > > <g.turkington@improvedigital.com
> > >
> > > wrote:
> > >
> > > >Hi,
> > > >
> > > >I talked with Gwen at Strata last week and promised to share some
> > > >of my experiences benchmarking an app reliant on the new  producer.
> > > >I'm using relatively meaty boxes running my producer code (24
> > > >core/64GB RAM) but I wasn't pushing them until I got them on the
> > > >same 10GB fabric as the
> > Kafka
> > > >cluster they are using (saturating the prior 1GB NICs was just too
> > easy).
> > > >There are 5 brokers, 24 core/192GB RAM/8*2TB disks, running 0.8.2.1.
> > > >
> > > >With lots of cores and a dedicated box the question was then how to
> > > >deploy my application. In particular how many worker threads and
> > > >how
> > many
> > > >instances of the KafkaProducer  to share amongst them. I also
> > > >wanted to see how things would change as I scale up the thread count.
> > > >
> > > >I ripped out the data retrieval part of my app (it reads from S3)
> > > >and instead replaced it with some code to produce random records of
> > > >average size 500 bytes but varying between 250 and 750. I started
> > > >the app running, ignored the first 25m messages then measured the
> > > >timing for the next 100m and  calculated the average messages/sec
> > > >written to Kafka across that run.
> > > >
> > > >Starting small I created 4 application threads with a range of
> > approaches
> > > >to sharing KafkaProducer instances. The records written to the
> > > >Kafka cluster per second were as follows:
> > > >
> > > >4 threads all sharing 1 client: 332K
> > > >4 threads sharing 2 clients: 333K
> > > >4 threads, dedicated client per thread: 310K
> > > >
> > > >Note that when I had 2 KafkaProducer clients as in the second line
> > > >above each was used by 2 threads. Similar below, number of
> > > >threads/number of clients is the max number of threads per
> KafkaProducer instance.
> > > >
> > > >As can be seen from the above there's not much in it. Scaling up to
> > > >8 application threads the numbers  looked like:
> > > >
> > > >8 threads sharing 1 client: 387K
> > > >8 threads sharing 2 clients: 609K
> > > >8 threads sharing 4 clients: 628K
> > > >8 threads with dedicated  client per thread: 527K
> > > >
> > > >This time sharing a single producer client  across all threads has
> > > >by
> > far
> > > >the worse performance and  isn't much better than when using 4
> > > >application threads. The 2 and 4 client options are much better and
> > > >are in the ballpark of 2x the 4 thread performance. A dedicated
> > > >client per thread isn't quite as good but isn't so far off to be
> > > >unusable. So then taking it to 16 application threads:
> > > >
> > > >16 threads sharing 1 client: 380K
> > > >16 threads sharing 2 clients: 675K
> > > >16 threads sharing 4 clients: 869K
> > > >16 threads sharing 8 clients: 733K
> > > >16 threads  with a dedicated client per thread: 491K
> > > >
> > > >This gives a much clearer performance curve. The 16 thread/4
> > > >producer client is by far the best performance but it is still far
> > > >from 4x the 4-thread or 2x the 8-thread mark. At this point I seem
> > > >to be hitting
> > some
> > > >limiting factor. On the client machine memory was still lightly
> > > >used, network was peaking just over 4GB/sec but CPU load was
> > > >showing 1 minute load averages around 18-20. CPU load did seem to
> > > >increase with as did
> > the
> > > >number of KafkaProducer instances but that is more a conclusion
> > > >from memory and not backed by hard numbers.
> > > >
> > > >For completeness sake I did do a 24 thread test but the numbers are
> > > >as you'd expect. 1 client and 24 both showed poor performance. 4,6
> > > >or 8 clients (24 has more  ways of dividing it by 2!) all showed
> > > >performance around that of the 16 thread/4 client run above. The
> > > >other configs were in-between.
> > > >
> > > >With my full application I've found the best deployment so far is
> > > >to
> > have
> > > >  multiple instances running on the same box. I can get much better
> > > >performance from 3 instances each with 8 threads than 1 instance
> > > >with 24 threads. This is almost certainly because when adding in my
> > > >own application logic and the AWS clients there is just a lot more
> > contention
> > > >- not to mention much more i/o waits -- in each application instance.
> > The
> > > >benchmark variant doesn't have as much happening but just to
> > > >compare I ran a few concurrent instances:
> > > >
> > > >2 copies of 8 threads sharing 4 clients: 780K total
> > > >2 copies of 8 threads sharing 2 clients: 870K total
> > > >3 copies of 8 threads sharing 2 clients: 945k total
> > > >
> > > >So bottom line - around 900K/sec is the max I can get from one of
> > > >these hosts for my application. At which point I brought a 2nd host
> > > >to bear
> > and
> > > >ran 2 concurrent instances of the best performing config on each:
> > > >
> > > >2 copies of 16 threads sharing 4 clients on 2 hosts: 1458k total
> > > >
> > > >This doesn't quite give 2x the single box performance but it does
> > > >show that the cluster has capacity to spare beyond what the single
> > > >client
> > host
> > > >can drive. This was also backed up by the metrics on the brokers,
> > > >they got busy but moderately so given the amount of work they were
> doing.
> > > >
> > > >At this point things  did get a bit 'ragged edge'. I noticed a very
> > > >high rate of ISR churn on the brokers, it looked like the replicas
> > > >were
> > having
> > > >trouble keeping up with the master and hosts were constantly being
> > > >dropped out then re-added to the ISR. I had set the test topic to
> > > >have a relatively low partition count (1 per spindle) so I doubled
> > > >that to see if it could help the ISRs remain  stable. And my
> > > >performance fell
> > through
> > > >the floor. So whereas I thought this was an equation involving
> > > >application threads and producer instances perhaps partition count
> > > >is a third. I need look into that some more but so far it looks
> > > >like that for my application - I'm not suggesting this is a
> > > >universal truth -- sharing a KafkaProducer instance amongst around 4
> threads is the sweet spot.
> > > >
> > > >I'll be doing further profiling of my application so I'll flag to
> > > >the list anything that appears within the producer itself. And
> > > >because 900K messages/sec was so close to a significant number I
> > > >modified my code
> > that
> > > >generates the messages to keep the key random for each message but
> > > >to
> > use
> > > >repeated message bodies across multiple messages. At which point
> > > >1.05m messages/sec was possible - from a single box. Nice. :)
> > > >
> > > >This turned out much longer than planned, I probably should have
> > > >blogged this somewhere. If anyone reads this far hope it is useful
> > > >or of interest, I'd be interested in hearing if the profiles I'm
> > > >seeing are expected and if any other tests would be useful.
> > > >
> > > >Regards
> > > >Garry
> > >
> > >
> >
>
>
>
> --
> -- Guozhang
>



-- 
Regards,
Tao

RE: Experiences testing new producer performance across multiple threads/producer counts

Posted by Garry Turkington <g....@improvedigital.com>.
Hi Guozhang/Jay/Becket,

Thanks for the responses.

Regarding my point on performance dropping when the number of partitions was increased, that surprised me too as on another cluster I had done just this to help with the issue of lots of ISR churn and it had been a straight win.

I mentioned in my last mail that I had simplified the code to generate test messages with the effect that it greatly reduced the CPU load per thread. After doing this the performance on the higher partition-count topic was consistent with the lower partition count one and showed no degredation. So the sender threads were becoming CPU bound, I'm assuming possibly due to the additional locks involved with more partitions but that needs validation.

I've been running my clients with acks=1, linger.ms floating between 0 and 1 because I want to convince myself of it making a difference but so far I've not really seen it and similar to Jay's experiences settled on 64K for batch.size because I just didn't see any benefit of anything beyond that and even the jump from 32K wasn't proved beneficial. For this particular application I've already hit the needed performance (around 700K/sec at peak) but my workload can be quite a sawtooth moving from peak to idle and back again. So peak becomes the new norm and understanding the head-room in the current setup and how to grow beyond that is important.

I've had a few more test boxes put on the same 10GB network as the cluster in question so I'll re-visit this and do deeper profiling over the next week and will revert here with findings.

Regards
Garry

-----Original Message-----
From: Guozhang Wang [mailto:wangguoz@gmail.com] 
Sent: 14 May 2015 18:57
To: users@kafka.apache.org
Subject: Re: Experiences testing new producer performance across multiple threads/producer counts

Regarding the issue that adding more partitions kill the performance: I would suspect it maybe due to not-sufficient batching. Note that in the new producer batching is done per-partition, and if linger.ms setting low, partition data may not be batched enough before they got sent to the brokers. Also since the new producer will drain all partitions that belongs to the same broker, when one of them hits either linger time or batch size, when you only have one or a few brokers this will further exaggerate the not-sufficient-batching issue. So monitoring on average batch size would be a good idea.

Guozhang

On Wed, May 13, 2015 at 7:47 PM, Jay Kreps <ja...@gmail.com> wrote:

> Hey Garry,
>
> Super interesting. We honestly never did a ton of performance tuning 
> on the producer. I checked the profiles early on in development and we 
> fixed a few issues that popped up in deployment, but I don't think 
> anyone has done a really scientific look. If you (or anyone else) want 
> to dive into things I suspect it could be improved.
>
> Becket is exactly right. There are two possible bottlenecks you can 
> hit in the producer--the single background sender thread and the 
> per-partition lock. You can check utilization on the background thread 
> with jvisualvm (it's named something like 
> kafka-producer-network-thread). The locking is fairly hard to improve.
>
> It's a little surprising that adding partitions caused a large 
> decrease in performance. Generally this is only the case if you 
> override the flush settings on the broker to force fsync more frequently.
>
> The ISR issues under heavy load are probably fixable, the issue is 
> discussed a bit here:
>
> http://blog.confluent.io/2015/04/07/hands-free-kafka-replication-a-les
> son-in-operational-simplicity/
>
> The producer settings that may matter for performance are:
> acks
> batch.size (though beyond 32k I didn't see much improvement) linger.ms 
> (setting >= 1 may help a bit) send.buffer.bytes (maybe, but probably 
> not)
>
> Cheers,
>
> -Jay
>
> On Wed, May 13, 2015 at 3:42 PM, Jiangjie Qin 
> <jq...@linkedin.com.invalid>
> wrote:
>
> > Thanks for sharing this, Garry. I actually did similar tests before 
> > but unfortunately lost the test data because my laptop rebooted and 
> > I forgot to save the dataŠ
> >
> > Anyway, several things to verify:
> >
> > 1. Remember KafkaProducer holds lock per partition. So if you have 
> > only one partition in the target topic and many application threads. 
> > Lock contention could be an issue.
> >
> > 2. It matters that how frequent the sender thread wake up and runs. 
> > You can take a look at the following sensors to further verify 
> > whether the sender thread really become a bottleneck or not:
> > Select-rate
> > Io-wait-time-ns-avg
> > Io-time-ns-avg
> >
> > 3. Batch size matters, so take a look at the sensor batch-size-avg 
> > and
> see
> > if the average batch size makes sense or not.
> >
> > Looking forward to your further profiling. My thinking is that 
> > unless you are sending very small messages to a small number of 
> > partitions. You
> don¹t
> > need to worry about to use more than one producer.
> >
> > Thanks.
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On 5/13/15, 2:40 PM, "Garry Turkington" 
> > <g.turkington@improvedigital.com
> >
> > wrote:
> >
> > >Hi,
> > >
> > >I talked with Gwen at Strata last week and promised to share some 
> > >of my experiences benchmarking an app reliant on the new  producer. 
> > >I'm using relatively meaty boxes running my producer code (24 
> > >core/64GB RAM) but I wasn't pushing them until I got them on the 
> > >same 10GB fabric as the
> Kafka
> > >cluster they are using (saturating the prior 1GB NICs was just too
> easy).
> > >There are 5 brokers, 24 core/192GB RAM/8*2TB disks, running 0.8.2.1.
> > >
> > >With lots of cores and a dedicated box the question was then how to 
> > >deploy my application. In particular how many worker threads and 
> > >how
> many
> > >instances of the KafkaProducer  to share amongst them. I also 
> > >wanted to see how things would change as I scale up the thread count.
> > >
> > >I ripped out the data retrieval part of my app (it reads from S3) 
> > >and instead replaced it with some code to produce random records of 
> > >average size 500 bytes but varying between 250 and 750. I started 
> > >the app running, ignored the first 25m messages then measured the 
> > >timing for the next 100m and  calculated the average messages/sec 
> > >written to Kafka across that run.
> > >
> > >Starting small I created 4 application threads with a range of
> approaches
> > >to sharing KafkaProducer instances. The records written to the 
> > >Kafka cluster per second were as follows:
> > >
> > >4 threads all sharing 1 client: 332K
> > >4 threads sharing 2 clients: 333K
> > >4 threads, dedicated client per thread: 310K
> > >
> > >Note that when I had 2 KafkaProducer clients as in the second line 
> > >above each was used by 2 threads. Similar below, number of 
> > >threads/number of clients is the max number of threads per KafkaProducer instance.
> > >
> > >As can be seen from the above there's not much in it. Scaling up to 
> > >8 application threads the numbers  looked like:
> > >
> > >8 threads sharing 1 client: 387K
> > >8 threads sharing 2 clients: 609K
> > >8 threads sharing 4 clients: 628K
> > >8 threads with dedicated  client per thread: 527K
> > >
> > >This time sharing a single producer client  across all threads has 
> > >by
> far
> > >the worse performance and  isn't much better than when using 4 
> > >application threads. The 2 and 4 client options are much better and 
> > >are in the ballpark of 2x the 4 thread performance. A dedicated 
> > >client per thread isn't quite as good but isn't so far off to be 
> > >unusable. So then taking it to 16 application threads:
> > >
> > >16 threads sharing 1 client: 380K
> > >16 threads sharing 2 clients: 675K
> > >16 threads sharing 4 clients: 869K
> > >16 threads sharing 8 clients: 733K
> > >16 threads  with a dedicated client per thread: 491K
> > >
> > >This gives a much clearer performance curve. The 16 thread/4 
> > >producer client is by far the best performance but it is still far 
> > >from 4x the 4-thread or 2x the 8-thread mark. At this point I seem 
> > >to be hitting
> some
> > >limiting factor. On the client machine memory was still lightly 
> > >used, network was peaking just over 4GB/sec but CPU load was 
> > >showing 1 minute load averages around 18-20. CPU load did seem to 
> > >increase with as did
> the
> > >number of KafkaProducer instances but that is more a conclusion 
> > >from memory and not backed by hard numbers.
> > >
> > >For completeness sake I did do a 24 thread test but the numbers are 
> > >as you'd expect. 1 client and 24 both showed poor performance. 4,6 
> > >or 8 clients (24 has more  ways of dividing it by 2!) all showed 
> > >performance around that of the 16 thread/4 client run above. The 
> > >other configs were in-between.
> > >
> > >With my full application I've found the best deployment so far is 
> > >to
> have
> > >  multiple instances running on the same box. I can get much better 
> > >performance from 3 instances each with 8 threads than 1 instance 
> > >with 24 threads. This is almost certainly because when adding in my 
> > >own application logic and the AWS clients there is just a lot more
> contention
> > >- not to mention much more i/o waits -- in each application instance.
> The
> > >benchmark variant doesn't have as much happening but just to 
> > >compare I ran a few concurrent instances:
> > >
> > >2 copies of 8 threads sharing 4 clients: 780K total
> > >2 copies of 8 threads sharing 2 clients: 870K total
> > >3 copies of 8 threads sharing 2 clients: 945k total
> > >
> > >So bottom line - around 900K/sec is the max I can get from one of 
> > >these hosts for my application. At which point I brought a 2nd host 
> > >to bear
> and
> > >ran 2 concurrent instances of the best performing config on each:
> > >
> > >2 copies of 16 threads sharing 4 clients on 2 hosts: 1458k total
> > >
> > >This doesn't quite give 2x the single box performance but it does 
> > >show that the cluster has capacity to spare beyond what the single 
> > >client
> host
> > >can drive. This was also backed up by the metrics on the brokers, 
> > >they got busy but moderately so given the amount of work they were doing.
> > >
> > >At this point things  did get a bit 'ragged edge'. I noticed a very 
> > >high rate of ISR churn on the brokers, it looked like the replicas 
> > >were
> having
> > >trouble keeping up with the master and hosts were constantly being 
> > >dropped out then re-added to the ISR. I had set the test topic to 
> > >have a relatively low partition count (1 per spindle) so I doubled 
> > >that to see if it could help the ISRs remain  stable. And my 
> > >performance fell
> through
> > >the floor. So whereas I thought this was an equation involving 
> > >application threads and producer instances perhaps partition count 
> > >is a third. I need look into that some more but so far it looks 
> > >like that for my application - I'm not suggesting this is a 
> > >universal truth -- sharing a KafkaProducer instance amongst around 4 threads is the sweet spot.
> > >
> > >I'll be doing further profiling of my application so I'll flag to 
> > >the list anything that appears within the producer itself. And 
> > >because 900K messages/sec was so close to a significant number I 
> > >modified my code
> that
> > >generates the messages to keep the key random for each message but 
> > >to
> use
> > >repeated message bodies across multiple messages. At which point 
> > >1.05m messages/sec was possible - from a single box. Nice. :)
> > >
> > >This turned out much longer than planned, I probably should have 
> > >blogged this somewhere. If anyone reads this far hope it is useful 
> > >or of interest, I'd be interested in hearing if the profiles I'm 
> > >seeing are expected and if any other tests would be useful.
> > >
> > >Regards
> > >Garry
> >
> >
>



--
-- Guozhang

Re: Experiences testing new producer performance across multiple threads/producer counts

Posted by Guozhang Wang <wa...@gmail.com>.
Regarding the issue that adding more partitions kill the performance: I
would suspect it maybe due to not-sufficient batching. Note that in the new
producer batching is done per-partition, and if linger.ms setting low,
partition data may not be batched enough before they got sent to the
brokers. Also since the new producer will drain all partitions that belongs
to the same broker, when one of them hits either linger time or batch size,
when you only have one or a few brokers this will further exaggerate the
not-sufficient-batching issue. So monitoring on average batch size would be
a good idea.

Guozhang

On Wed, May 13, 2015 at 7:47 PM, Jay Kreps <ja...@gmail.com> wrote:

> Hey Garry,
>
> Super interesting. We honestly never did a ton of performance tuning on the
> producer. I checked the profiles early on in development and we fixed a few
> issues that popped up in deployment, but I don't think anyone has done a
> really scientific look. If you (or anyone else) want to dive into things I
> suspect it could be improved.
>
> Becket is exactly right. There are two possible bottlenecks you can hit in
> the producer--the single background sender thread and the per-partition
> lock. You can check utilization on the background thread with jvisualvm
> (it's named something like kafka-producer-network-thread). The locking is
> fairly hard to improve.
>
> It's a little surprising that adding partitions caused a large decrease in
> performance. Generally this is only the case if you override the flush
> settings on the broker to force fsync more frequently.
>
> The ISR issues under heavy load are probably fixable, the issue is
> discussed a bit here:
>
> http://blog.confluent.io/2015/04/07/hands-free-kafka-replication-a-lesson-in-operational-simplicity/
>
> The producer settings that may matter for performance are:
> acks
> batch.size (though beyond 32k I didn't see much improvement)
> linger.ms (setting >= 1 may help a bit)
> send.buffer.bytes (maybe, but probably not)
>
> Cheers,
>
> -Jay
>
> On Wed, May 13, 2015 at 3:42 PM, Jiangjie Qin <jq...@linkedin.com.invalid>
> wrote:
>
> > Thanks for sharing this, Garry. I actually did similar tests before but
> > unfortunately lost the test data because my laptop rebooted and I forgot
> > to save the dataŠ
> >
> > Anyway, several things to verify:
> >
> > 1. Remember KafkaProducer holds lock per partition. So if you have only
> > one partition in the target topic and many application threads. Lock
> > contention could be an issue.
> >
> > 2. It matters that how frequent the sender thread wake up and runs. You
> > can take a look at the following sensors to further verify whether the
> > sender thread really become a bottleneck or not:
> > Select-rate
> > Io-wait-time-ns-avg
> > Io-time-ns-avg
> >
> > 3. Batch size matters, so take a look at the sensor batch-size-avg and
> see
> > if the average batch size makes sense or not.
> >
> > Looking forward to your further profiling. My thinking is that unless you
> > are sending very small messages to a small number of partitions. You
> don¹t
> > need to worry about to use more than one producer.
> >
> > Thanks.
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On 5/13/15, 2:40 PM, "Garry Turkington" <g.turkington@improvedigital.com
> >
> > wrote:
> >
> > >Hi,
> > >
> > >I talked with Gwen at Strata last week and promised to share some of my
> > >experiences benchmarking an app reliant on the new  producer. I'm using
> > >relatively meaty boxes running my producer code (24 core/64GB RAM) but I
> > >wasn't pushing them until I got them on the same 10GB fabric as the
> Kafka
> > >cluster they are using (saturating the prior 1GB NICs was just too
> easy).
> > >There are 5 brokers, 24 core/192GB RAM/8*2TB disks, running 0.8.2.1.
> > >
> > >With lots of cores and a dedicated box the question was then how to
> > >deploy my application. In particular how many worker threads and how
> many
> > >instances of the KafkaProducer  to share amongst them. I also wanted to
> > >see how things would change as I scale up the thread count.
> > >
> > >I ripped out the data retrieval part of my app (it reads from S3) and
> > >instead replaced it with some code to produce random records of average
> > >size 500 bytes but varying between 250 and 750. I started the app
> > >running, ignored the first 25m messages then measured the timing for the
> > >next 100m and  calculated the average messages/sec written to Kafka
> > >across that run.
> > >
> > >Starting small I created 4 application threads with a range of
> approaches
> > >to sharing KafkaProducer instances. The records written to the Kafka
> > >cluster per second were as follows:
> > >
> > >4 threads all sharing 1 client: 332K
> > >4 threads sharing 2 clients: 333K
> > >4 threads, dedicated client per thread: 310K
> > >
> > >Note that when I had 2 KafkaProducer clients as in the second line above
> > >each was used by 2 threads. Similar below, number of threads/number of
> > >clients is the max number of threads per KafkaProducer instance.
> > >
> > >As can be seen from the above there's not much in it. Scaling up to 8
> > >application threads the numbers  looked like:
> > >
> > >8 threads sharing 1 client: 387K
> > >8 threads sharing 2 clients: 609K
> > >8 threads sharing 4 clients: 628K
> > >8 threads with dedicated  client per thread: 527K
> > >
> > >This time sharing a single producer client  across all threads has by
> far
> > >the worse performance and  isn't much better than when using 4
> > >application threads. The 2 and 4 client options are much better and are
> > >in the ballpark of 2x the 4 thread performance. A dedicated client per
> > >thread isn't quite as good but isn't so far off to be unusable. So then
> > >taking it to 16 application threads:
> > >
> > >16 threads sharing 1 client: 380K
> > >16 threads sharing 2 clients: 675K
> > >16 threads sharing 4 clients: 869K
> > >16 threads sharing 8 clients: 733K
> > >16 threads  with a dedicated client per thread: 491K
> > >
> > >This gives a much clearer performance curve. The 16 thread/4 producer
> > >client is by far the best performance but it is still far from 4x the
> > >4-thread or 2x the 8-thread mark. At this point I seem to be hitting
> some
> > >limiting factor. On the client machine memory was still lightly used,
> > >network was peaking just over 4GB/sec but CPU load was showing 1 minute
> > >load averages around 18-20. CPU load did seem to increase with as did
> the
> > >number of KafkaProducer instances but that is more a conclusion from
> > >memory and not backed by hard numbers.
> > >
> > >For completeness sake I did do a 24 thread test but the numbers are as
> > >you'd expect. 1 client and 24 both showed poor performance. 4,6 or 8
> > >clients (24 has more  ways of dividing it by 2!) all showed performance
> > >around that of the 16 thread/4 client run above. The other configs were
> > >in-between.
> > >
> > >With my full application I've found the best deployment so far is to
> have
> > >  multiple instances running on the same box. I can get much better
> > >performance from 3 instances each with 8 threads than 1 instance with 24
> > >threads. This is almost certainly because when adding in my own
> > >application logic and the AWS clients there is just a lot more
> contention
> > >- not to mention much more i/o waits -- in each application instance.
> The
> > >benchmark variant doesn't have as much happening but just to compare I
> > >ran a few concurrent instances:
> > >
> > >2 copies of 8 threads sharing 4 clients: 780K total
> > >2 copies of 8 threads sharing 2 clients: 870K total
> > >3 copies of 8 threads sharing 2 clients: 945k total
> > >
> > >So bottom line - around 900K/sec is the max I can get from one of these
> > >hosts for my application. At which point I brought a 2nd host to bear
> and
> > >ran 2 concurrent instances of the best performing config on each:
> > >
> > >2 copies of 16 threads sharing 4 clients on 2 hosts: 1458k total
> > >
> > >This doesn't quite give 2x the single box performance but it does show
> > >that the cluster has capacity to spare beyond what the single client
> host
> > >can drive. This was also backed up by the metrics on the brokers, they
> > >got busy but moderately so given the amount of work they were doing.
> > >
> > >At this point things  did get a bit 'ragged edge'. I noticed a very high
> > >rate of ISR churn on the brokers, it looked like the replicas were
> having
> > >trouble keeping up with the master and hosts were constantly being
> > >dropped out then re-added to the ISR. I had set the test topic to have a
> > >relatively low partition count (1 per spindle) so I doubled that to see
> > >if it could help the ISRs remain  stable. And my performance fell
> through
> > >the floor. So whereas I thought this was an equation involving
> > >application threads and producer instances perhaps partition count is a
> > >third. I need look into that some more but so far it looks like that for
> > >my application - I'm not suggesting this is a universal truth -- sharing
> > >a KafkaProducer instance amongst around 4 threads is the sweet spot.
> > >
> > >I'll be doing further profiling of my application so I'll flag to the
> > >list anything that appears within the producer itself. And because 900K
> > >messages/sec was so close to a significant number I modified my code
> that
> > >generates the messages to keep the key random for each message but to
> use
> > >repeated message bodies across multiple messages. At which point 1.05m
> > >messages/sec was possible - from a single box. Nice. :)
> > >
> > >This turned out much longer than planned, I probably should have blogged
> > >this somewhere. If anyone reads this far hope it is useful or of
> > >interest, I'd be interested in hearing if the profiles I'm seeing are
> > >expected and if any other tests would be useful.
> > >
> > >Regards
> > >Garry
> >
> >
>



-- 
-- Guozhang

Re: Experiences testing new producer performance across multiple threads/producer counts

Posted by Jay Kreps <ja...@gmail.com>.
Hey Garry,

Super interesting. We honestly never did a ton of performance tuning on the
producer. I checked the profiles early on in development and we fixed a few
issues that popped up in deployment, but I don't think anyone has done a
really scientific look. If you (or anyone else) want to dive into things I
suspect it could be improved.

Becket is exactly right. There are two possible bottlenecks you can hit in
the producer--the single background sender thread and the per-partition
lock. You can check utilization on the background thread with jvisualvm
(it's named something like kafka-producer-network-thread). The locking is
fairly hard to improve.

It's a little surprising that adding partitions caused a large decrease in
performance. Generally this is only the case if you override the flush
settings on the broker to force fsync more frequently.

The ISR issues under heavy load are probably fixable, the issue is
discussed a bit here:
http://blog.confluent.io/2015/04/07/hands-free-kafka-replication-a-lesson-in-operational-simplicity/

The producer settings that may matter for performance are:
acks
batch.size (though beyond 32k I didn't see much improvement)
linger.ms (setting >= 1 may help a bit)
send.buffer.bytes (maybe, but probably not)

Cheers,

-Jay

On Wed, May 13, 2015 at 3:42 PM, Jiangjie Qin <jq...@linkedin.com.invalid>
wrote:

> Thanks for sharing this, Garry. I actually did similar tests before but
> unfortunately lost the test data because my laptop rebooted and I forgot
> to save the dataŠ
>
> Anyway, several things to verify:
>
> 1. Remember KafkaProducer holds lock per partition. So if you have only
> one partition in the target topic and many application threads. Lock
> contention could be an issue.
>
> 2. It matters that how frequent the sender thread wake up and runs. You
> can take a look at the following sensors to further verify whether the
> sender thread really become a bottleneck or not:
> Select-rate
> Io-wait-time-ns-avg
> Io-time-ns-avg
>
> 3. Batch size matters, so take a look at the sensor batch-size-avg and see
> if the average batch size makes sense or not.
>
> Looking forward to your further profiling. My thinking is that unless you
> are sending very small messages to a small number of partitions. You don¹t
> need to worry about to use more than one producer.
>
> Thanks.
>
> Jiangjie (Becket) Qin
>
>
>
> On 5/13/15, 2:40 PM, "Garry Turkington" <g....@improvedigital.com>
> wrote:
>
> >Hi,
> >
> >I talked with Gwen at Strata last week and promised to share some of my
> >experiences benchmarking an app reliant on the new  producer. I'm using
> >relatively meaty boxes running my producer code (24 core/64GB RAM) but I
> >wasn't pushing them until I got them on the same 10GB fabric as the Kafka
> >cluster they are using (saturating the prior 1GB NICs was just too easy).
> >There are 5 brokers, 24 core/192GB RAM/8*2TB disks, running 0.8.2.1.
> >
> >With lots of cores and a dedicated box the question was then how to
> >deploy my application. In particular how many worker threads and how many
> >instances of the KafkaProducer  to share amongst them. I also wanted to
> >see how things would change as I scale up the thread count.
> >
> >I ripped out the data retrieval part of my app (it reads from S3) and
> >instead replaced it with some code to produce random records of average
> >size 500 bytes but varying between 250 and 750. I started the app
> >running, ignored the first 25m messages then measured the timing for the
> >next 100m and  calculated the average messages/sec written to Kafka
> >across that run.
> >
> >Starting small I created 4 application threads with a range of approaches
> >to sharing KafkaProducer instances. The records written to the Kafka
> >cluster per second were as follows:
> >
> >4 threads all sharing 1 client: 332K
> >4 threads sharing 2 clients: 333K
> >4 threads, dedicated client per thread: 310K
> >
> >Note that when I had 2 KafkaProducer clients as in the second line above
> >each was used by 2 threads. Similar below, number of threads/number of
> >clients is the max number of threads per KafkaProducer instance.
> >
> >As can be seen from the above there's not much in it. Scaling up to 8
> >application threads the numbers  looked like:
> >
> >8 threads sharing 1 client: 387K
> >8 threads sharing 2 clients: 609K
> >8 threads sharing 4 clients: 628K
> >8 threads with dedicated  client per thread: 527K
> >
> >This time sharing a single producer client  across all threads has by far
> >the worse performance and  isn't much better than when using 4
> >application threads. The 2 and 4 client options are much better and are
> >in the ballpark of 2x the 4 thread performance. A dedicated client per
> >thread isn't quite as good but isn't so far off to be unusable. So then
> >taking it to 16 application threads:
> >
> >16 threads sharing 1 client: 380K
> >16 threads sharing 2 clients: 675K
> >16 threads sharing 4 clients: 869K
> >16 threads sharing 8 clients: 733K
> >16 threads  with a dedicated client per thread: 491K
> >
> >This gives a much clearer performance curve. The 16 thread/4 producer
> >client is by far the best performance but it is still far from 4x the
> >4-thread or 2x the 8-thread mark. At this point I seem to be hitting some
> >limiting factor. On the client machine memory was still lightly used,
> >network was peaking just over 4GB/sec but CPU load was showing 1 minute
> >load averages around 18-20. CPU load did seem to increase with as did the
> >number of KafkaProducer instances but that is more a conclusion from
> >memory and not backed by hard numbers.
> >
> >For completeness sake I did do a 24 thread test but the numbers are as
> >you'd expect. 1 client and 24 both showed poor performance. 4,6 or 8
> >clients (24 has more  ways of dividing it by 2!) all showed performance
> >around that of the 16 thread/4 client run above. The other configs were
> >in-between.
> >
> >With my full application I've found the best deployment so far is to have
> >  multiple instances running on the same box. I can get much better
> >performance from 3 instances each with 8 threads than 1 instance with 24
> >threads. This is almost certainly because when adding in my own
> >application logic and the AWS clients there is just a lot more contention
> >- not to mention much more i/o waits -- in each application instance. The
> >benchmark variant doesn't have as much happening but just to compare I
> >ran a few concurrent instances:
> >
> >2 copies of 8 threads sharing 4 clients: 780K total
> >2 copies of 8 threads sharing 2 clients: 870K total
> >3 copies of 8 threads sharing 2 clients: 945k total
> >
> >So bottom line - around 900K/sec is the max I can get from one of these
> >hosts for my application. At which point I brought a 2nd host to bear and
> >ran 2 concurrent instances of the best performing config on each:
> >
> >2 copies of 16 threads sharing 4 clients on 2 hosts: 1458k total
> >
> >This doesn't quite give 2x the single box performance but it does show
> >that the cluster has capacity to spare beyond what the single client host
> >can drive. This was also backed up by the metrics on the brokers, they
> >got busy but moderately so given the amount of work they were doing.
> >
> >At this point things  did get a bit 'ragged edge'. I noticed a very high
> >rate of ISR churn on the brokers, it looked like the replicas were having
> >trouble keeping up with the master and hosts were constantly being
> >dropped out then re-added to the ISR. I had set the test topic to have a
> >relatively low partition count (1 per spindle) so I doubled that to see
> >if it could help the ISRs remain  stable. And my performance fell through
> >the floor. So whereas I thought this was an equation involving
> >application threads and producer instances perhaps partition count is a
> >third. I need look into that some more but so far it looks like that for
> >my application - I'm not suggesting this is a universal truth -- sharing
> >a KafkaProducer instance amongst around 4 threads is the sweet spot.
> >
> >I'll be doing further profiling of my application so I'll flag to the
> >list anything that appears within the producer itself. And because 900K
> >messages/sec was so close to a significant number I modified my code that
> >generates the messages to keep the key random for each message but to use
> >repeated message bodies across multiple messages. At which point 1.05m
> >messages/sec was possible - from a single box. Nice. :)
> >
> >This turned out much longer than planned, I probably should have blogged
> >this somewhere. If anyone reads this far hope it is useful or of
> >interest, I'd be interested in hearing if the profiles I'm seeing are
> >expected and if any other tests would be useful.
> >
> >Regards
> >Garry
>
>

Re: Experiences testing new producer performance across multiple threads/producer counts

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.
Thanks for sharing this, Garry. I actually did similar tests before but
unfortunately lost the test data because my laptop rebooted and I forgot
to save the dataŠ

Anyway, several things to verify:

1. Remember KafkaProducer holds lock per partition. So if you have only
one partition in the target topic and many application threads. Lock
contention could be an issue.

2. It matters that how frequent the sender thread wake up and runs. You
can take a look at the following sensors to further verify whether the
sender thread really become a bottleneck or not:
Select-rate
Io-wait-time-ns-avg
Io-time-ns-avg

3. Batch size matters, so take a look at the sensor batch-size-avg and see
if the average batch size makes sense or not.

Looking forward to your further profiling. My thinking is that unless you
are sending very small messages to a small number of partitions. You don¹t
need to worry about to use more than one producer.

Thanks.

Jiangjie (Becket) Qin



On 5/13/15, 2:40 PM, "Garry Turkington" <g....@improvedigital.com>
wrote:

>Hi,
>
>I talked with Gwen at Strata last week and promised to share some of my
>experiences benchmarking an app reliant on the new  producer. I'm using
>relatively meaty boxes running my producer code (24 core/64GB RAM) but I
>wasn't pushing them until I got them on the same 10GB fabric as the Kafka
>cluster they are using (saturating the prior 1GB NICs was just too easy).
>There are 5 brokers, 24 core/192GB RAM/8*2TB disks, running 0.8.2.1.
>
>With lots of cores and a dedicated box the question was then how to
>deploy my application. In particular how many worker threads and how many
>instances of the KafkaProducer  to share amongst them. I also wanted to
>see how things would change as I scale up the thread count.
>
>I ripped out the data retrieval part of my app (it reads from S3) and
>instead replaced it with some code to produce random records of average
>size 500 bytes but varying between 250 and 750. I started the app
>running, ignored the first 25m messages then measured the timing for the
>next 100m and  calculated the average messages/sec written to Kafka
>across that run.
>
>Starting small I created 4 application threads with a range of approaches
>to sharing KafkaProducer instances. The records written to the Kafka
>cluster per second were as follows:
>
>4 threads all sharing 1 client: 332K
>4 threads sharing 2 clients: 333K
>4 threads, dedicated client per thread: 310K
>
>Note that when I had 2 KafkaProducer clients as in the second line above
>each was used by 2 threads. Similar below, number of threads/number of
>clients is the max number of threads per KafkaProducer instance.
>
>As can be seen from the above there's not much in it. Scaling up to 8
>application threads the numbers  looked like:
>
>8 threads sharing 1 client: 387K
>8 threads sharing 2 clients: 609K
>8 threads sharing 4 clients: 628K
>8 threads with dedicated  client per thread: 527K
>
>This time sharing a single producer client  across all threads has by far
>the worse performance and  isn't much better than when using 4
>application threads. The 2 and 4 client options are much better and are
>in the ballpark of 2x the 4 thread performance. A dedicated client per
>thread isn't quite as good but isn't so far off to be unusable. So then
>taking it to 16 application threads:
>
>16 threads sharing 1 client: 380K
>16 threads sharing 2 clients: 675K
>16 threads sharing 4 clients: 869K
>16 threads sharing 8 clients: 733K
>16 threads  with a dedicated client per thread: 491K
>
>This gives a much clearer performance curve. The 16 thread/4 producer
>client is by far the best performance but it is still far from 4x the
>4-thread or 2x the 8-thread mark. At this point I seem to be hitting some
>limiting factor. On the client machine memory was still lightly used,
>network was peaking just over 4GB/sec but CPU load was showing 1 minute
>load averages around 18-20. CPU load did seem to increase with as did the
>number of KafkaProducer instances but that is more a conclusion from
>memory and not backed by hard numbers.
>
>For completeness sake I did do a 24 thread test but the numbers are as
>you'd expect. 1 client and 24 both showed poor performance. 4,6 or 8
>clients (24 has more  ways of dividing it by 2!) all showed performance
>around that of the 16 thread/4 client run above. The other configs were
>in-between.
>
>With my full application I've found the best deployment so far is to have
>  multiple instances running on the same box. I can get much better
>performance from 3 instances each with 8 threads than 1 instance with 24
>threads. This is almost certainly because when adding in my own
>application logic and the AWS clients there is just a lot more contention
>- not to mention much more i/o waits -- in each application instance. The
>benchmark variant doesn't have as much happening but just to compare I
>ran a few concurrent instances:
>
>2 copies of 8 threads sharing 4 clients: 780K total
>2 copies of 8 threads sharing 2 clients: 870K total
>3 copies of 8 threads sharing 2 clients: 945k total
>
>So bottom line - around 900K/sec is the max I can get from one of these
>hosts for my application. At which point I brought a 2nd host to bear and
>ran 2 concurrent instances of the best performing config on each:
>
>2 copies of 16 threads sharing 4 clients on 2 hosts: 1458k total
>
>This doesn't quite give 2x the single box performance but it does show
>that the cluster has capacity to spare beyond what the single client host
>can drive. This was also backed up by the metrics on the brokers, they
>got busy but moderately so given the amount of work they were doing.
>
>At this point things  did get a bit 'ragged edge'. I noticed a very high
>rate of ISR churn on the brokers, it looked like the replicas were having
>trouble keeping up with the master and hosts were constantly being
>dropped out then re-added to the ISR. I had set the test topic to have a
>relatively low partition count (1 per spindle) so I doubled that to see
>if it could help the ISRs remain  stable. And my performance fell through
>the floor. So whereas I thought this was an equation involving
>application threads and producer instances perhaps partition count is a
>third. I need look into that some more but so far it looks like that for
>my application - I'm not suggesting this is a universal truth -- sharing
>a KafkaProducer instance amongst around 4 threads is the sweet spot.
>
>I'll be doing further profiling of my application so I'll flag to the
>list anything that appears within the producer itself. And because 900K
>messages/sec was so close to a significant number I modified my code that
>generates the messages to keep the key random for each message but to use
>repeated message bodies across multiple messages. At which point 1.05m
>messages/sec was possible - from a single box. Nice. :)
>
>This turned out much longer than planned, I probably should have blogged
>this somewhere. If anyone reads this far hope it is useful or of
>interest, I'd be interested in hearing if the profiles I'm seeing are
>expected and if any other tests would be useful.
>
>Regards
>Garry