You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Gwen Shapira <gw...@confluent.io> on 2016/05/10 01:49:44 UTC

[VOTE] 0.10.0.0 RC4

Hello Kafka users, developers and client-developers,

This is the first candidate for release of Apache Kafka 0.10.0.0. This
is a major release that includes: (1) New message format including
timestamps (2) client interceptor API (3) Kafka Streams. Since this is
a major release, we will give people more time to try it out and give
feedback.

We fixed a bunch of bugs since previous release candidate, and we are
done with the blockers.
So lets test and vote!

Release notes for the 0.10.0.0 release:
http://home.apache.org/~gwenshap/0.10.0.0-rc4/RELEASE_NOTES.html

*** Please download, test and vote by Monday, May 16, 9am PT

Kafka's KEYS file containing PGP keys we use to sign the release:
http://kafka.apache.org/KEYS

* Release artifacts to be voted upon (source and binary):
http://home.apache.org/~gwenshap/0.10.0.0-rc4/

* Maven artifacts to be voted upon:
https://repository.apache.org/content/groups/staging/

* scala-doc
http://home.apache.org/~gwenshap/0.10.0.0-rc4/scaladoc

* java-doc
http://home.apache.org/~gwenshap/0.10.0.0-rc4/javadoc/

* tag to be voted upon (off 0.10.0 branch) is the 0.10.0.0-rc4 tag:
https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=tag;h=31cf73c737d79b09b62acce461c7c22461ff5ad8

* Documentation:
http://kafka.apache.org/0100/documentation.html

* Protocol:
http://kafka.apache.org/0100/protocol.html

/**************************************

Thanks,

Gwen

Re: [VOTE] 0.10.0.0 RC4

Posted by Ismael Juma <is...@juma.me.uk>.

Hi Tom,

This is puzzling because, as you said, not much has changed in the TLS code
since 0.9.0.1. A JIRA sounds good. I was going to ask if you could test the
commit before/after KAFKA-3025, but I see that Gwen has already done that.
:)

Ismael

On Thu, May 12, 2016 at 9:26 PM, Tom Crayford <tc...@heroku.com> wrote:

> We've started running our usual suite of performance tests against Kafka
> 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines to
> run a fairly normal mixed workload of producers and consumers (each
> producer/consumer are just instances of kafka's inbuilt consumer/producer
> perf tests). We've found about a 33% performance drop in the producer if
> TLS is used (compared to 0.9.0.1)
>
> We've seen notable producer performance degredations between 0.9.0.1 and
> 0.10.0.0 RC. We're running as of the commit 9404680 right now.
>
> Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> networking. Nothing is changed between the instances, and I've reproduced
> this over 4 different sets of clusters now. We're seeing about a 33%
> performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680. Please
> to note that this doesn't match up with
> https://issues.apache.org/jira/browse/KAFKA-3565, because our performance
> tests are with compression off, and this seems to be an TLS only issue.
>
> Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
> and 13 producers max out at around 1 million 100 byte messages a second.
> Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
> tests were with TLS on. I've reproduced this on multiple clusters now (5 or
> so of each version) to account for the inherent performance variance of
> EC2. There's no notable performance difference without TLS on these runs -
> it appears to be an TLS regression entirely.
>
> A single producer with TLS under 0.10 does about 75k messages/s. Under
> 0.9.0.01 it does around 120k messages/s.
>
> The exact producer-perf line we're using is this:
>
> bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
> --record-size "100" --throughput "100" --producer-props acks="-1"
> bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> ssl.truststore.password=REDACTED
> ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
>
> We're using the same setup, machine type etc for each test run.
>
> We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the TLS
> performance impact was there for both.
>
> I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
> anything that seemed to have this kind of impact - indeed the TLS code
> doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
>
> Any thoughts? Should I file an issue and see about reproducing a more
> minimal test case?
>
> I don't think this is related to
> https://issues.apache.org/jira/browse/KAFKA-3565 - that is for compression
> on and plaintext, and this is for TLS only.
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Gwen Shapira <gw...@confluent.io>.

I know it is a big ask, but can you try bisecting?

For example, test before/after on commits:
* 45c8195 KAFKA-3025; Added timetamp to Message and use relative offset.
* 5b375d7 KAFKA-3149; Extend SASL implementation to support more mechanisms
* 69d9a66 KAFKA-3618; Handle ApiVersionsRequest before SASL authentication

This may help us nail down the issue source of the issue.

Gwen

On Thu, May 12, 2016 at 1:38 PM, Tom Crayford <tc...@heroku.com> wrote:
> Yep, confirm.
>
> On Thu, May 12, 2016 at 9:37 PM, Gwen Shapira <gw...@confluent.io> wrote:
>
>> Just to confirm:
>> You tested both versions with plain text and saw no performance drop?
>>
>>
>> On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tc...@heroku.com>
>> wrote:
>> > We've started running our usual suite of performance tests against Kafka
>> > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines
>> to
>> > run a fairly normal mixed workload of producers and consumers (each
>> > producer/consumer are just instances of kafka's inbuilt consumer/producer
>> > perf tests). We've found about a 33% performance drop in the producer if
>> > TLS is used (compared to 0.9.0.1)
>> >
>> > We've seen notable producer performance degredations between 0.9.0.1 and
>> > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
>> >
>> > Our specific test case runs Kafka on 8 EC2 machines, with enhanced
>> > networking. Nothing is changed between the instances, and I've reproduced
>> > this over 4 different sets of clusters now. We're seeing about a 33%
>> > performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680.
>> Please
>> > to note that this doesn't match up with
>> > https://issues.apache.org/jira/browse/KAFKA-3565, because our
>> performance
>> > tests are with compression off, and this seems to be an TLS only issue.
>> >
>> > Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
>> > and 13 producers max out at around 1 million 100 byte messages a second.
>> > Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
>> > tests were with TLS on. I've reproduced this on multiple clusters now (5
>> or
>> > so of each version) to account for the inherent performance variance of
>> > EC2. There's no notable performance difference without TLS on these runs
>> -
>> > it appears to be an TLS regression entirely.
>> >
>> > A single producer with TLS under 0.10 does about 75k messages/s. Under
>> > 0.9.0.01 it does around 120k messages/s.
>> >
>> > The exact producer-perf line we're using is this:
>> >
>> > bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
>> > --record-size "100" --throughput "100" --producer-props acks="-1"
>> > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
>> > ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
>> > ssl.truststore.password=REDACTED
>> > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
>> >
>> > We're using the same setup, machine type etc for each test run.
>> >
>> > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the
>> TLS
>> > performance impact was there for both.
>> >
>> > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
>> > anything that seemed to have this kind of impact - indeed the TLS code
>> > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
>> >
>> > Any thoughts? Should I file an issue and see about reproducing a more
>> > minimal test case?
>> >
>> > I don't think this is related to
>> > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
>> compression
>> > on and plaintext, and this is for TLS only.
>>

Re: [VOTE] 0.10.0.0 RC4

Posted by Gwen Shapira <gw...@confluent.io>.

I know it is a big ask, but can you try bisecting?

For example, test before/after on commits:
* 45c8195 KAFKA-3025; Added timetamp to Message and use relative offset.
* 5b375d7 KAFKA-3149; Extend SASL implementation to support more mechanisms
* 69d9a66 KAFKA-3618; Handle ApiVersionsRequest before SASL authentication

This may help us nail down the issue source of the issue.

Gwen

On Thu, May 12, 2016 at 1:38 PM, Tom Crayford <tc...@heroku.com> wrote:
> Yep, confirm.
>
> On Thu, May 12, 2016 at 9:37 PM, Gwen Shapira <gw...@confluent.io> wrote:
>
>> Just to confirm:
>> You tested both versions with plain text and saw no performance drop?
>>
>>
>> On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tc...@heroku.com>
>> wrote:
>> > We've started running our usual suite of performance tests against Kafka
>> > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines
>> to
>> > run a fairly normal mixed workload of producers and consumers (each
>> > producer/consumer are just instances of kafka's inbuilt consumer/producer
>> > perf tests). We've found about a 33% performance drop in the producer if
>> > TLS is used (compared to 0.9.0.1)
>> >
>> > We've seen notable producer performance degredations between 0.9.0.1 and
>> > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
>> >
>> > Our specific test case runs Kafka on 8 EC2 machines, with enhanced
>> > networking. Nothing is changed between the instances, and I've reproduced
>> > this over 4 different sets of clusters now. We're seeing about a 33%
>> > performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680.
>> Please
>> > to note that this doesn't match up with
>> > https://issues.apache.org/jira/browse/KAFKA-3565, because our
>> performance
>> > tests are with compression off, and this seems to be an TLS only issue.
>> >
>> > Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
>> > and 13 producers max out at around 1 million 100 byte messages a second.
>> > Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
>> > tests were with TLS on. I've reproduced this on multiple clusters now (5
>> or
>> > so of each version) to account for the inherent performance variance of
>> > EC2. There's no notable performance difference without TLS on these runs
>> -
>> > it appears to be an TLS regression entirely.
>> >
>> > A single producer with TLS under 0.10 does about 75k messages/s. Under
>> > 0.9.0.01 it does around 120k messages/s.
>> >
>> > The exact producer-perf line we're using is this:
>> >
>> > bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
>> > --record-size "100" --throughput "100" --producer-props acks="-1"
>> > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
>> > ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
>> > ssl.truststore.password=REDACTED
>> > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
>> >
>> > We're using the same setup, machine type etc for each test run.
>> >
>> > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the
>> TLS
>> > performance impact was there for both.
>> >
>> > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
>> > anything that seemed to have this kind of impact - indeed the TLS code
>> > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
>> >
>> > Any thoughts? Should I file an issue and see about reproducing a more
>> > minimal test case?
>> >
>> > I don't think this is related to
>> > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
>> compression
>> > on and plaintext, and this is for TLS only.
>>

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

Yep, confirm.

On Thu, May 12, 2016 at 9:37 PM, Gwen Shapira <gw...@confluent.io> wrote:

> Just to confirm:
> You tested both versions with plain text and saw no performance drop?
>
>
> On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tc...@heroku.com>
> wrote:
> > We've started running our usual suite of performance tests against Kafka
> > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines
> to
> > run a fairly normal mixed workload of producers and consumers (each
> > producer/consumer are just instances of kafka's inbuilt consumer/producer
> > perf tests). We've found about a 33% performance drop in the producer if
> > TLS is used (compared to 0.9.0.1)
> >
> > We've seen notable producer performance degredations between 0.9.0.1 and
> > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
> >
> > Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> > networking. Nothing is changed between the instances, and I've reproduced
> > this over 4 different sets of clusters now. We're seeing about a 33%
> > performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680.
> Please
> > to note that this doesn't match up with
> > https://issues.apache.org/jira/browse/KAFKA-3565, because our
> performance
> > tests are with compression off, and this seems to be an TLS only issue.
> >
> > Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
> > and 13 producers max out at around 1 million 100 byte messages a second.
> > Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
> > tests were with TLS on. I've reproduced this on multiple clusters now (5
> or
> > so of each version) to account for the inherent performance variance of
> > EC2. There's no notable performance difference without TLS on these runs
> -
> > it appears to be an TLS regression entirely.
> >
> > A single producer with TLS under 0.10 does about 75k messages/s. Under
> > 0.9.0.01 it does around 120k messages/s.
> >
> > The exact producer-perf line we're using is this:
> >
> > bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
> > --record-size "100" --throughput "100" --producer-props acks="-1"
> > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> > ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> > ssl.truststore.password=REDACTED
> > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
> >
> > We're using the same setup, machine type etc for each test run.
> >
> > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the
> TLS
> > performance impact was there for both.
> >
> > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
> > anything that seemed to have this kind of impact - indeed the TLS code
> > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
> >
> > Any thoughts? Should I file an issue and see about reproducing a more
> > minimal test case?
> >
> > I don't think this is related to
> > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
> compression
> > on and plaintext, and this is for TLS only.
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

Yep, confirm.

On Thu, May 12, 2016 at 9:37 PM, Gwen Shapira <gw...@confluent.io> wrote:

> Just to confirm:
> You tested both versions with plain text and saw no performance drop?
>
>
> On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tc...@heroku.com>
> wrote:
> > We've started running our usual suite of performance tests against Kafka
> > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines
> to
> > run a fairly normal mixed workload of producers and consumers (each
> > producer/consumer are just instances of kafka's inbuilt consumer/producer
> > perf tests). We've found about a 33% performance drop in the producer if
> > TLS is used (compared to 0.9.0.1)
> >
> > We've seen notable producer performance degredations between 0.9.0.1 and
> > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
> >
> > Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> > networking. Nothing is changed between the instances, and I've reproduced
> > this over 4 different sets of clusters now. We're seeing about a 33%
> > performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680.
> Please
> > to note that this doesn't match up with
> > https://issues.apache.org/jira/browse/KAFKA-3565, because our
> performance
> > tests are with compression off, and this seems to be an TLS only issue.
> >
> > Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
> > and 13 producers max out at around 1 million 100 byte messages a second.
> > Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
> > tests were with TLS on. I've reproduced this on multiple clusters now (5
> or
> > so of each version) to account for the inherent performance variance of
> > EC2. There's no notable performance difference without TLS on these runs
> -
> > it appears to be an TLS regression entirely.
> >
> > A single producer with TLS under 0.10 does about 75k messages/s. Under
> > 0.9.0.01 it does around 120k messages/s.
> >
> > The exact producer-perf line we're using is this:
> >
> > bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
> > --record-size "100" --throughput "100" --producer-props acks="-1"
> > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> > ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> > ssl.truststore.password=REDACTED
> > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
> >
> > We're using the same setup, machine type etc for each test run.
> >
> > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the
> TLS
> > performance impact was there for both.
> >
> > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
> > anything that seemed to have this kind of impact - indeed the TLS code
> > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
> >
> > Any thoughts? Should I file an issue and see about reproducing a more
> > minimal test case?
> >
> > I don't think this is related to
> > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
> compression
> > on and plaintext, and this is for TLS only.
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Gwen Shapira <gw...@confluent.io>.

Just to confirm:
You tested both versions with plain text and saw no performance drop?


On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tc...@heroku.com> wrote:
> We've started running our usual suite of performance tests against Kafka
> 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines to
> run a fairly normal mixed workload of producers and consumers (each
> producer/consumer are just instances of kafka's inbuilt consumer/producer
> perf tests). We've found about a 33% performance drop in the producer if
> TLS is used (compared to 0.9.0.1)
>
> We've seen notable producer performance degredations between 0.9.0.1 and
> 0.10.0.0 RC. We're running as of the commit 9404680 right now.
>
> Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> networking. Nothing is changed between the instances, and I've reproduced
> this over 4 different sets of clusters now. We're seeing about a 33%
> performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680. Please
> to note that this doesn't match up with
> https://issues.apache.org/jira/browse/KAFKA-3565, because our performance
> tests are with compression off, and this seems to be an TLS only issue.
>
> Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
> and 13 producers max out at around 1 million 100 byte messages a second.
> Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
> tests were with TLS on. I've reproduced this on multiple clusters now (5 or
> so of each version) to account for the inherent performance variance of
> EC2. There's no notable performance difference without TLS on these runs -
> it appears to be an TLS regression entirely.
>
> A single producer with TLS under 0.10 does about 75k messages/s. Under
> 0.9.0.01 it does around 120k messages/s.
>
> The exact producer-perf line we're using is this:
>
> bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
> --record-size "100" --throughput "100" --producer-props acks="-1"
> bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> ssl.truststore.password=REDACTED
> ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
>
> We're using the same setup, machine type etc for each test run.
>
> We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the TLS
> performance impact was there for both.
>
> I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
> anything that seemed to have this kind of impact - indeed the TLS code
> doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
>
> Any thoughts? Should I file an issue and see about reproducing a more
> minimal test case?
>
> I don't think this is related to
> https://issues.apache.org/jira/browse/KAFKA-3565 - that is for compression
> on and plaintext, and this is for TLS only.

Re: [VOTE] 0.10.0.0 RC4

Posted by Gwen Shapira <gw...@confluent.io>.

Thanks, man!

Good to see Heroku being good friends to the Kafka community by
testing new releases, reporting issues and following up with the cause
and documentation pr.

With this out of the way, we closed all known blockers for 0.10.0.0.
I'll roll out a new RC tomorrow morning.

Gwen

On Sun, May 15, 2016 at 1:27 PM, Tom Crayford <tc...@heroku.com> wrote:
> https://github.com/apache/kafka/pull/1389
>
> On Sun, May 15, 2016 at 9:22 PM, Ismael Juma <is...@juma.me.uk> wrote:
>
>> Hi Tom,
>>
>> Great to hear that the failure testing scenario went well. :)
>>
>> Your suggested improvement sounds good to me and a PR would be great. For
>> this kind of change, you can skip the JIRA, just prefix the PR title with
>> `MINOR:`.
>>
>> Thanks,
>> Ismael
>>
>> On Sun, May 15, 2016 at 9:17 PM, Tom Crayford <tc...@heroku.com>
>> wrote:
>>
>> > How about this?
>> >
>> >     <b>Note:</b> Due to the additional timestamp introduced in each
>> message
>> > (8 bytes of data), producers sending small messages may see a
>> >     message throughput degradation because of the increased overhead.
>> > Likewise, replication now transmits an additional 8 bytes per message.
>> >     If you're running close to the network capacity of your cluster, it's
>> > possible that you'll overwhelm the network cards and see failures and
>> > performance
>> >     issues due to the overload.
>> >     When receiving compressed messages, 0.10.0
>> >     brokers avoid recompressing the messages, which in general reduces
>> the
>> > latency and improves the throughput. In
>> >     certain cases, this may reduce the batching size on the producer,
>> which
>> > could lead to worse throughput. If this
>> >     happens, users can tune linger.ms and batch.size of the producer for
>> > better throughput.
>> >
>> > Would you like a Jira/PR with this kind of change so we can discuss them
>> in
>> > a more convenient format?
>> >
>> > Re our failure testing scenario: Kafka 0.10 RC behaves exactly the same
>> > under failure as 0.9 - the controller typically shifts the leader in
>> around
>> > 2 seconds or so, and the benchmark sees a small drop in throughput during
>> > that, then another drop whilst the replacement broker comes back to
>> speed.
>> > So, overall we're extremely happy and excited for this release! Thanks to
>> > the committers and maintainers for all their hard work.
>> >
>> > On Sun, May 15, 2016 at 9:03 PM, Ismael Juma <is...@juma.me.uk> wrote:
>> >
>> > > Hi Tom,
>> > >
>> > > Thanks for the update and for all the testing you have done! No worries
>> > > about the chase here, I'd much rather have false positives by people
>> who
>> > > are validating the releases than false negatives because people don't
>> > > validate the releases. :)
>> > >
>> > > The upgrade note we currently have follows:
>> > >
>> > > https://github.com/apache/kafka/blob/0.10.0/docs/upgrade.html#L67
>> > >
>> > > Please feel free to suggest improvements.
>> > >
>> > > Thanks,
>> > > Ismael
>> > >
>> > > On Sun, May 15, 2016 at 6:39 PM, Tom Crayford <tc...@heroku.com>
>> > > wrote:
>> > >
>> > > > I've been digging into this some more. It seems like this may have
>> been
>> > > an
>> > > > issue with benchmarks maxing out the network card - under 0.10.0.0-RC
>> > the
>> > > > slightly additional bandwidth per message seems to have pushed the
>> > > broker's
>> > > > NIC into overload territory where it starts dropping packets
>> (verified
>> > > with
>> > > > ifconfig on each broker). This leads to it not being able to talk to
>> > > > Zookeeper properly, which leads to OfflinePartitions, which then
>> causes
>> > > > issues with the benchmarks validity, as throughput drops a lot when
>> > > brokers
>> > > > are flapping in and out of being online. 0.9.0.1 doing that 8 bytes
>> > less
>> > > > per message means the broker's NIC can sustain more messages/s. There
>> > was
>> > > > an "alignment" issue with the benchmarks here - under 0.9 we were
>> > *just*
>> > > at
>> > > > the barrier of the broker's NICs sustaining traffic, and under 0.10
>> we
>> > > > pushed over that (at 1.5 million messages/s, 8 bytes extra per
>> message
>> > is
>> > > > an extra 36 MB/s with replication factor 3 [if my math is right, and
>> > > that's
>> > > > before SSL encryption which may be additional overhead], which is as
>> > much
>> > > > as an additional producer machine).
>> > > >
>> > > > The dropped packets and the flapping weren't causing notable timeout
>> > > issues
>> > > > in the producer, but looking at the metrics on the brokers, offline
>> > > > partitions was clearly triggered and undergoing, and the broker logs
>> > show
>> > > > ZK session timeouts. This is consistent with earlier benchmarking
>> > > > experience - the number of producers we were running under 0.9.0.1
>> was
>> > > > carefully selected to be just under the limit here.
>> > > >
>> > > > The other issue with the benchmark where I reported an issue between
>> > two
>> > > > single producers was caused by a "performance of producer machine"
>> > issue
>> > > > that I wasn't properly aware of. Apologies there.
>> > > >
>> > > > I've done benchmarks now where I limit the producer throughput (via
>> > > > --throughput) to slightly below what the NICs can sustain and seen no
>> > > > notable performance or stability difference between 0.10 and 0.9.0.1
>> as
>> > > > long as you stay under the limits of the network interfaces. All of
>> the
>> > > > clusters I have tested happily keep up a benchmark at this rate for 6
>> > > hours
>> > > > under both 0.9.0.1 and 0.10.0.0. I've also verified that our clusters
>> > are
>> > > > entirely network bound in these producer benchmarking scenarios - the
>> > > disks
>> > > > and CPU/memory have a bunch of remaining capacity.
>> > > >
>> > > > This was pretty hard to verify fully, which is why I've taken so long
>> > to
>> > > > reply. All in all I think the result here is expected and not a
>> blocker
>> > > for
>> > > > release, but a good thing to note on upgrades - if folk are running
>> at
>> > > the
>> > > > limit of their network cards (which you never want to do anyway, but
>> > > > benchmarking scenarios often uncover those limits), they'll see
>> issues
>> > > due
>> > > > to increased replication and producer traffic under 0.10.0.0.
>> > > >
>> > > > Apologies for the chase here - this distinctly seemed like a real
>> issue
>> > > and
>> > > > one I (and I think everybody else) would have wanted to block the
>> > release
>> > > > on. I'm going to move onto our "failure" testing, in which we run the
>> > > same
>> > > > performance benchmarks whilst causing a hard kill on the node. We've
>> > seen
>> > > > very good results for that under 0.9 and hopefully they'll continue
>> > under
>> > > > 0.10.
>> > > >
>> > > > On Sat, May 14, 2016 at 1:33 AM, Gwen Shapira <gw...@confluent.io>
>> > wrote:
>> > > >
>> > > > > also, perhaps sharing the broker configuration? maybe this will
>> > > > > provide some hints...
>> > > > >
>> > > > > On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <is...@juma.me.uk>
>> > > wrote:
>> > > > > > Thanks Tom. I just wanted to share that I have been unable to
>> > > reproduce
>> > > > > > this so far. Please feel free to share whatever you information
>> you
>> > > > have
>> > > > > so
>> > > > > > far when you have a chance, don't feel that you need to have all
>> > the
>> > > > > > answers.
>> > > > > >
>> > > > > > Ismael
>> > > > > >
>> > > > > > On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <
>> > tcrayford@heroku.com>
>> > > > > wrote:
>> > > > > >
>> > > > > >> I've been investigating this pretty hard since I first noticed
>> it.
>> > > > Right
>> > > > > >> now I have more avenues for investigation than I can shake a
>> stick
>> > > at,
>> > > > > and
>> > > > > >> am also dealing with several other things in flight/on fire.
>> I'll
>> > > > > respond
>> > > > > >> when I have more information and can confirm things.
>> > > > > >>
>> > > > > >> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <
>> becket.qin@gmail.com
>> > >
>> > > > > wrote:
>> > > > > >>
>> > > > > >> > Tom,
>> > > > > >> >
>> > > > > >> > Maybe it is mentioned and I missed. I am wondering if you see
>> > > > > performance
>> > > > > >> > degradation on the consumer side when TLS is used? This could
>> > help
>> > > > us
>> > > > > >> > understand whether the issue is only producer related or TLS
>> in
>> > > > > general.
>> > > > > >> >
>> > > > > >> > Thanks,
>> > > > > >> >
>> > > > > >> > Jiangjie (Becket) Qin
>> > > > > >> >
>> > > > > >> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <
>> > > tcrayford@heroku.com
>> > > > >
>> > > > > >> > wrote:
>> > > > > >> >
>> > > > > >> > > Ismael,
>> > > > > >> > >
>> > > > > >> > > Thanks. I'm writing up an issue with some new findings since
>> > > > > yesterday
>> > > > > >> > > right now.
>> > > > > >> > >
>> > > > > >> > > Thanks
>> > > > > >> > >
>> > > > > >> > > Tom
>> > > > > >> > >
>> > > > > >> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <
>> > ismael@juma.me.uk
>> > > >
>> > > > > >> wrote:
>> > > > > >> > >
>> > > > > >> > > > Hi Tom,
>> > > > > >> > > >
>> > > > > >> > > > That's because JIRA is in lockdown due to excessive spam.
>> I
>> > > have
>> > > > > >> added
>> > > > > >> > > you
>> > > > > >> > > > as a contributor in JIRA and you should be able to file a
>> > > ticket
>> > > > > now.
>> > > > > >> > > >
>> > > > > >> > > > Thanks,
>> > > > > >> > > > Ismael
>> > > > > >> > > >
>> > > > > >> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <
>> > > > > tcrayford@heroku.com
>> > > > > >> >
>> > > > > >> > > > wrote:
>> > > > > >> > > >
>> > > > > >> > > > > Ok, I don't seem to be able to file a new Jira issue at
>> > all.
>> > > > Can
>> > > > > >> > > somebody
>> > > > > >> > > > > check my permissions on Jira? My user is
>> > `tcrayford-heroku`
>> > > > > >> > > > >
>> > > > > >> > > > > Tom Crayford
>> > > > > >> > > > > Heroku Kafka
>> > > > > >> > > > >
>> > > > > >> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <
>> > jun@confluent.io
>> > > >
>> > > > > >> wrote:
>> > > > > >> > > > >
>> > > > > >> > > > > > Tom,
>> > > > > >> > > > > >
>> > > > > >> > > > > > We don't have a CSV metrics reporter in the producer
>> > right
>> > > > > now.
>> > > > > >> The
>> > > > > >> > > > > metrics
>> > > > > >> > > > > > will be available in jmx. You can find out the details
>> > in
>> > > > > >> > > > > >
>> > > > > >>
>> > http://kafka.apache.org/documentation.html#new_producer_monitoring
>> > > > > >> > > > > >
>> > > > > >> > > > > > Thanks,
>> > > > > >> > > > > >
>> > > > > >> > > > > > Jun
>> > > > > >> > > > > >
>> > > > > >> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
>> > > > > >> > tcrayford@heroku.com>
>> > > > > >> > > > > > wrote:
>> > > > > >> > > > > >
>> > > > > >> > > > > > > Yep, I can try those particular commits tomorrow.
>> > > Before I
>> > > > > try
>> > > > > >> a
>> > > > > >> > > > > bisect,
>> > > > > >> > > > > > > I'm going to replicate with a less intensive to
>> > iterate
>> > > on
>> > > > > >> > smaller
>> > > > > >> > > > > scale
>> > > > > >> > > > > > > perf test.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > Jun, inline:
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > On Thursday, 12 May 2016, Jun Rao <jun@confluent.io
>> >
>> > > > wrote:
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > > Tom,
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > Thanks for reporting this. A few quick comments.
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > 1. Did you send the right command for
>> producer-perf?
>> > > The
>> > > > > >> > command
>> > > > > >> > > > > limits
>> > > > > >> > > > > > > the
>> > > > > >> > > > > > > > throughput to 100 msgs/sec. So, not sure how a
>> > single
>> > > > > >> producer
>> > > > > >> > > can
>> > > > > >> > > > > get
>> > > > > >> > > > > > > 75K
>> > > > > >> > > > > > > > msgs/sec.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > Ah yep, wrong commands. I'll get the right one
>> > tomorrow.
>> > > > > Sorry,
>> > > > > >> > was
>> > > > > >> > > > > > > interpolating variables into a shell script.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > 2. Could you collect some stats (e.g. average
>> batch
>> > > > size)
>> > > > > in
>> > > > > >> > the
>> > > > > >> > > > > > producer
>> > > > > >> > > > > > > > and see if there is any noticeable difference
>> > between
>> > > > 0.9
>> > > > > and
>> > > > > >> > > 0.10?
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > That'd just be hooking up the CSV metrics reporter
>> > > right?
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > 3. Is the broker-to-broker communication also on
>> > SSL?
>> > > > > Could
>> > > > > >> you
>> > > > > >> > > do
>> > > > > >> > > > > > > another
>> > > > > >> > > > > > > > test with replication factor 1 and see if you
>> still
>> > > see
>> > > > > the
>> > > > > >> > > > > > degradation?
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > Interbroker replication is always SSL in all test
>> runs
>> > > so
>> > > > > far.
>> > > > > >> I
>> > > > > >> > > can
>> > > > > >> > > > > try
>> > > > > >> > > > > > > with replication factor 1 tomorrow.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > Finally, email is probably not the best way to
>> > discuss
>> > > > > >> > > performance
>> > > > > >> > > > > > > results.
>> > > > > >> > > > > > > > If you have more of them, could you create a jira
>> > and
>> > > > > attach
>> > > > > >> > your
>> > > > > >> > > > > > > findings
>> > > > > >> > > > > > > > there?
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > Yep. I only wrote the email because JIRA was in
>> > lockdown
>> > > > > mode
>> > > > > >> > and I
>> > > > > >> > > > > > > couldn't create new issues.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > Thanks,
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > Jun
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
>> > > > > >> > > > tcrayford@heroku.com
>> > > > > >> > > > > > > > <javascript:;>> wrote:
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > > We've started running our usual suite of
>> > performance
>> > > > > tests
>> > > > > >> > > > against
>> > > > > >> > > > > > > Kafka
>> > > > > >> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
>> > > > > >> > consumer/producer
>> > > > > >> > > > > > > machines
>> > > > > >> > > > > > > > to
>> > > > > >> > > > > > > > > run a fairly normal mixed workload of producers
>> > and
>> > > > > >> consumers
>> > > > > >> > > > (each
>> > > > > >> > > > > > > > > producer/consumer are just instances of kafka's
>> > > > inbuilt
>> > > > > >> > > > > > > consumer/producer
>> > > > > >> > > > > > > > > perf tests). We've found about a 33% performance
>> > > drop
>> > > > in
>> > > > > >> the
>> > > > > >> > > > > producer
>> > > > > >> > > > > > > if
>> > > > > >> > > > > > > > > TLS is used (compared to 0.9.0.1)
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > We've seen notable producer performance
>> > degredations
>> > > > > >> between
>> > > > > >> > > > > 0.9.0.1
>> > > > > >> > > > > > > and
>> > > > > >> > > > > > > > > 0.10.0.0 RC. We're running as of the commit
>> > 9404680
>> > > > > right
>> > > > > >> > now.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > Our specific test case runs Kafka on 8 EC2
>> > machines,
>> > > > > with
>> > > > > >> > > > enhanced
>> > > > > >> > > > > > > > > networking. Nothing is changed between the
>> > > instances,
>> > > > > and
>> > > > > >> > I've
>> > > > > >> > > > > > > reproduced
>> > > > > >> > > > > > > > > this over 4 different sets of clusters now.
>> We're
>> > > > seeing
>> > > > > >> > about
>> > > > > >> > > a
>> > > > > >> > > > > 33%
>> > > > > >> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as
>> > of
>> > > > > commit
>> > > > > >> > > > 9404680.
>> > > > > >> > > > > > > > Please
>> > > > > >> > > > > > > > > to note that this doesn't match up with
>> > > > > >> > > > > > > > >
>> https://issues.apache.org/jira/browse/KAFKA-3565,
>> > > > > because
>> > > > > >> > our
>> > > > > >> > > > > > > > performance
>> > > > > >> > > > > > > > > tests are with compression off, and this seems
>> to
>> > be
>> > > > an
>> > > > > TLS
>> > > > > >> > > only
>> > > > > >> > > > > > issue.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with
>> > > > > replication
>> > > > > >> > > > factor
>> > > > > >> > > > > of
>> > > > > >> > > > > > > 3,
>> > > > > >> > > > > > > > > and 13 producers max out at around 1 million 100
>> > > byte
>> > > > > >> > messages
>> > > > > >> > > a
>> > > > > >> > > > > > > second.
>> > > > > >> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million
>> > > > > messages a
>> > > > > >> > > > second.
>> > > > > >> > > > > > > Both
>> > > > > >> > > > > > > > > tests were with TLS on. I've reproduced this on
>> > > > multiple
>> > > > > >> > > clusters
>> > > > > >> > > > > now
>> > > > > >> > > > > > > (5
>> > > > > >> > > > > > > > or
>> > > > > >> > > > > > > > > so of each version) to account for the inherent
>> > > > > performance
>> > > > > >> > > > > variance
>> > > > > >> > > > > > of
>> > > > > >> > > > > > > > > EC2. There's no notable performance difference
>> > > without
>> > > > > TLS
>> > > > > >> on
>> > > > > >> > > > these
>> > > > > >> > > > > > > runs
>> > > > > >> > > > > > > > -
>> > > > > >> > > > > > > > > it appears to be an TLS regression entirely.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > A single producer with TLS under 0.10 does about
>> > 75k
>> > > > > >> > > messages/s.
>> > > > > >> > > > > > Under
>> > > > > >> > > > > > > > > 0.9.0.01 it does around 120k messages/s.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > The exact producer-perf line we're using is
>> this:
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > bin/kafka-producer-perf-test --topic "bench"
>> > > > > --num-records
>> > > > > >> > > > > > "500000000"
>> > > > > >> > > > > > > > > --record-size "100" --throughput "100"
>> > > > --producer-props
>> > > > > >> > > acks="-1"
>> > > > > >> > > > > > > > > bootstrap.servers=REDACTED
>> > > > > ssl.keystore.location=client.jks
>> > > > > >> > > > > > > > > ssl.keystore.password=REDACTED
>> > > > > >> > > ssl.truststore.location=server.jks
>> > > > > >> > > > > > > > > ssl.truststore.password=REDACTED
>> > > > > >> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
>> > > > > >> > > security.protocol=SSL
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > We're using the same setup, machine type etc for
>> > > each
>> > > > > test
>> > > > > >> > run.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > We've tried using both 0.9.0.1 producers and
>> > > 0.10.0.0
>> > > > > >> > producers
>> > > > > >> > > > and
>> > > > > >> > > > > > the
>> > > > > >> > > > > > > > TLS
>> > > > > >> > > > > > > > > performance impact was there for both.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > I've glanced over the code between 0.9.0.1 and
>> > > > 0.10.0.0
>> > > > > and
>> > > > > >> > > > haven't
>> > > > > >> > > > > > > seen
>> > > > > >> > > > > > > > > anything that seemed to have this kind of
>> impact -
>> > > > > indeed
>> > > > > >> the
>> > > > > >> > > TLS
>> > > > > >> > > > > > code
>> > > > > >> > > > > > > > > doesn't seem to have changed much between
>> 0.9.0.1
>> > > and
>> > > > > >> > 0.10.0.0.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > Any thoughts? Should I file an issue and see
>> about
>> > > > > >> > reproducing
>> > > > > >> > > a
>> > > > > >> > > > > more
>> > > > > >> > > > > > > > > minimal test case?
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > I don't think this is related to
>> > > > > >> > > > > > > > >
>> https://issues.apache.org/jira/browse/KAFKA-3565
>> > -
>> > > > > that is
>> > > > > >> > for
>> > > > > >> > > > > > > > compression
>> > > > > >> > > > > > > > > on and plaintext, and this is for TLS only.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > >
>> > > > > >> > > > >
>> > > > > >> > > >
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

https://github.com/apache/kafka/pull/1389

On Sun, May 15, 2016 at 9:22 PM, Ismael Juma <is...@juma.me.uk> wrote:

> Hi Tom,
>
> Great to hear that the failure testing scenario went well. :)
>
> Your suggested improvement sounds good to me and a PR would be great. For
> this kind of change, you can skip the JIRA, just prefix the PR title with
> `MINOR:`.
>
> Thanks,
> Ismael
>
> On Sun, May 15, 2016 at 9:17 PM, Tom Crayford <tc...@heroku.com>
> wrote:
>
> > How about this?
> >
> >     <b>Note:</b> Due to the additional timestamp introduced in each
> message
> > (8 bytes of data), producers sending small messages may see a
> >     message throughput degradation because of the increased overhead.
> > Likewise, replication now transmits an additional 8 bytes per message.
> >     If you're running close to the network capacity of your cluster, it's
> > possible that you'll overwhelm the network cards and see failures and
> > performance
> >     issues due to the overload.
> >     When receiving compressed messages, 0.10.0
> >     brokers avoid recompressing the messages, which in general reduces
> the
> > latency and improves the throughput. In
> >     certain cases, this may reduce the batching size on the producer,
> which
> > could lead to worse throughput. If this
> >     happens, users can tune linger.ms and batch.size of the producer for
> > better throughput.
> >
> > Would you like a Jira/PR with this kind of change so we can discuss them
> in
> > a more convenient format?
> >
> > Re our failure testing scenario: Kafka 0.10 RC behaves exactly the same
> > under failure as 0.9 - the controller typically shifts the leader in
> around
> > 2 seconds or so, and the benchmark sees a small drop in throughput during
> > that, then another drop whilst the replacement broker comes back to
> speed.
> > So, overall we're extremely happy and excited for this release! Thanks to
> > the committers and maintainers for all their hard work.
> >
> > On Sun, May 15, 2016 at 9:03 PM, Ismael Juma <is...@juma.me.uk> wrote:
> >
> > > Hi Tom,
> > >
> > > Thanks for the update and for all the testing you have done! No worries
> > > about the chase here, I'd much rather have false positives by people
> who
> > > are validating the releases than false negatives because people don't
> > > validate the releases. :)
> > >
> > > The upgrade note we currently have follows:
> > >
> > > https://github.com/apache/kafka/blob/0.10.0/docs/upgrade.html#L67
> > >
> > > Please feel free to suggest improvements.
> > >
> > > Thanks,
> > > Ismael
> > >
> > > On Sun, May 15, 2016 at 6:39 PM, Tom Crayford <tc...@heroku.com>
> > > wrote:
> > >
> > > > I've been digging into this some more. It seems like this may have
> been
> > > an
> > > > issue with benchmarks maxing out the network card - under 0.10.0.0-RC
> > the
> > > > slightly additional bandwidth per message seems to have pushed the
> > > broker's
> > > > NIC into overload territory where it starts dropping packets
> (verified
> > > with
> > > > ifconfig on each broker). This leads to it not being able to talk to
> > > > Zookeeper properly, which leads to OfflinePartitions, which then
> causes
> > > > issues with the benchmarks validity, as throughput drops a lot when
> > > brokers
> > > > are flapping in and out of being online. 0.9.0.1 doing that 8 bytes
> > less
> > > > per message means the broker's NIC can sustain more messages/s. There
> > was
> > > > an "alignment" issue with the benchmarks here - under 0.9 we were
> > *just*
> > > at
> > > > the barrier of the broker's NICs sustaining traffic, and under 0.10
> we
> > > > pushed over that (at 1.5 million messages/s, 8 bytes extra per
> message
> > is
> > > > an extra 36 MB/s with replication factor 3 [if my math is right, and
> > > that's
> > > > before SSL encryption which may be additional overhead], which is as
> > much
> > > > as an additional producer machine).
> > > >
> > > > The dropped packets and the flapping weren't causing notable timeout
> > > issues
> > > > in the producer, but looking at the metrics on the brokers, offline
> > > > partitions was clearly triggered and undergoing, and the broker logs
> > show
> > > > ZK session timeouts. This is consistent with earlier benchmarking
> > > > experience - the number of producers we were running under 0.9.0.1
> was
> > > > carefully selected to be just under the limit here.
> > > >
> > > > The other issue with the benchmark where I reported an issue between
> > two
> > > > single producers was caused by a "performance of producer machine"
> > issue
> > > > that I wasn't properly aware of. Apologies there.
> > > >
> > > > I've done benchmarks now where I limit the producer throughput (via
> > > > --throughput) to slightly below what the NICs can sustain and seen no
> > > > notable performance or stability difference between 0.10 and 0.9.0.1
> as
> > > > long as you stay under the limits of the network interfaces. All of
> the
> > > > clusters I have tested happily keep up a benchmark at this rate for 6
> > > hours
> > > > under both 0.9.0.1 and 0.10.0.0. I've also verified that our clusters
> > are
> > > > entirely network bound in these producer benchmarking scenarios - the
> > > disks
> > > > and CPU/memory have a bunch of remaining capacity.
> > > >
> > > > This was pretty hard to verify fully, which is why I've taken so long
> > to
> > > > reply. All in all I think the result here is expected and not a
> blocker
> > > for
> > > > release, but a good thing to note on upgrades - if folk are running
> at
> > > the
> > > > limit of their network cards (which you never want to do anyway, but
> > > > benchmarking scenarios often uncover those limits), they'll see
> issues
> > > due
> > > > to increased replication and producer traffic under 0.10.0.0.
> > > >
> > > > Apologies for the chase here - this distinctly seemed like a real
> issue
> > > and
> > > > one I (and I think everybody else) would have wanted to block the
> > release
> > > > on. I'm going to move onto our "failure" testing, in which we run the
> > > same
> > > > performance benchmarks whilst causing a hard kill on the node. We've
> > seen
> > > > very good results for that under 0.9 and hopefully they'll continue
> > under
> > > > 0.10.
> > > >
> > > > On Sat, May 14, 2016 at 1:33 AM, Gwen Shapira <gw...@confluent.io>
> > wrote:
> > > >
> > > > > also, perhaps sharing the broker configuration? maybe this will
> > > > > provide some hints...
> > > > >
> > > > > On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <is...@juma.me.uk>
> > > wrote:
> > > > > > Thanks Tom. I just wanted to share that I have been unable to
> > > reproduce
> > > > > > this so far. Please feel free to share whatever you information
> you
> > > > have
> > > > > so
> > > > > > far when you have a chance, don't feel that you need to have all
> > the
> > > > > > answers.
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <
> > tcrayford@heroku.com>
> > > > > wrote:
> > > > > >
> > > > > >> I've been investigating this pretty hard since I first noticed
> it.
> > > > Right
> > > > > >> now I have more avenues for investigation than I can shake a
> stick
> > > at,
> > > > > and
> > > > > >> am also dealing with several other things in flight/on fire.
> I'll
> > > > > respond
> > > > > >> when I have more information and can confirm things.
> > > > > >>
> > > > > >> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <
> becket.qin@gmail.com
> > >
> > > > > wrote:
> > > > > >>
> > > > > >> > Tom,
> > > > > >> >
> > > > > >> > Maybe it is mentioned and I missed. I am wondering if you see
> > > > > performance
> > > > > >> > degradation on the consumer side when TLS is used? This could
> > help
> > > > us
> > > > > >> > understand whether the issue is only producer related or TLS
> in
> > > > > general.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> >
> > > > > >> > Jiangjie (Becket) Qin
> > > > > >> >
> > > > > >> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <
> > > tcrayford@heroku.com
> > > > >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Ismael,
> > > > > >> > >
> > > > > >> > > Thanks. I'm writing up an issue with some new findings since
> > > > > yesterday
> > > > > >> > > right now.
> > > > > >> > >
> > > > > >> > > Thanks
> > > > > >> > >
> > > > > >> > > Tom
> > > > > >> > >
> > > > > >> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <
> > ismael@juma.me.uk
> > > >
> > > > > >> wrote:
> > > > > >> > >
> > > > > >> > > > Hi Tom,
> > > > > >> > > >
> > > > > >> > > > That's because JIRA is in lockdown due to excessive spam.
> I
> > > have
> > > > > >> added
> > > > > >> > > you
> > > > > >> > > > as a contributor in JIRA and you should be able to file a
> > > ticket
> > > > > now.
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > > Ismael
> > > > > >> > > >
> > > > > >> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <
> > > > > tcrayford@heroku.com
> > > > > >> >
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Ok, I don't seem to be able to file a new Jira issue at
> > all.
> > > > Can
> > > > > >> > > somebody
> > > > > >> > > > > check my permissions on Jira? My user is
> > `tcrayford-heroku`
> > > > > >> > > > >
> > > > > >> > > > > Tom Crayford
> > > > > >> > > > > Heroku Kafka
> > > > > >> > > > >
> > > > > >> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <
> > jun@confluent.io
> > > >
> > > > > >> wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Tom,
> > > > > >> > > > > >
> > > > > >> > > > > > We don't have a CSV metrics reporter in the producer
> > right
> > > > > now.
> > > > > >> The
> > > > > >> > > > > metrics
> > > > > >> > > > > > will be available in jmx. You can find out the details
> > in
> > > > > >> > > > > >
> > > > > >>
> > http://kafka.apache.org/documentation.html#new_producer_monitoring
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks,
> > > > > >> > > > > >
> > > > > >> > > > > > Jun
> > > > > >> > > > > >
> > > > > >> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> > > > > >> > tcrayford@heroku.com>
> > > > > >> > > > > > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > Yep, I can try those particular commits tomorrow.
> > > Before I
> > > > > try
> > > > > >> a
> > > > > >> > > > > bisect,
> > > > > >> > > > > > > I'm going to replicate with a less intensive to
> > iterate
> > > on
> > > > > >> > smaller
> > > > > >> > > > > scale
> > > > > >> > > > > > > perf test.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Jun, inline:
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Thursday, 12 May 2016, Jun Rao <jun@confluent.io
> >
> > > > wrote:
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Tom,
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Thanks for reporting this. A few quick comments.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 1. Did you send the right command for
> producer-perf?
> > > The
> > > > > >> > command
> > > > > >> > > > > limits
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > throughput to 100 msgs/sec. So, not sure how a
> > single
> > > > > >> producer
> > > > > >> > > can
> > > > > >> > > > > get
> > > > > >> > > > > > > 75K
> > > > > >> > > > > > > > msgs/sec.
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > Ah yep, wrong commands. I'll get the right one
> > tomorrow.
> > > > > Sorry,
> > > > > >> > was
> > > > > >> > > > > > > interpolating variables into a shell script.
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 2. Could you collect some stats (e.g. average
> batch
> > > > size)
> > > > > in
> > > > > >> > the
> > > > > >> > > > > > producer
> > > > > >> > > > > > > > and see if there is any noticeable difference
> > between
> > > > 0.9
> > > > > and
> > > > > >> > > 0.10?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > That'd just be hooking up the CSV metrics reporter
> > > right?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 3. Is the broker-to-broker communication also on
> > SSL?
> > > > > Could
> > > > > >> you
> > > > > >> > > do
> > > > > >> > > > > > > another
> > > > > >> > > > > > > > test with replication factor 1 and see if you
> still
> > > see
> > > > > the
> > > > > >> > > > > > degradation?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > Interbroker replication is always SSL in all test
> runs
> > > so
> > > > > far.
> > > > > >> I
> > > > > >> > > can
> > > > > >> > > > > try
> > > > > >> > > > > > > with replication factor 1 tomorrow.
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Finally, email is probably not the best way to
> > discuss
> > > > > >> > > performance
> > > > > >> > > > > > > results.
> > > > > >> > > > > > > > If you have more of them, could you create a jira
> > and
> > > > > attach
> > > > > >> > your
> > > > > >> > > > > > > findings
> > > > > >> > > > > > > > there?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > Yep. I only wrote the email because JIRA was in
> > lockdown
> > > > > mode
> > > > > >> > and I
> > > > > >> > > > > > > couldn't create new issues.
> > > > > >> > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Thanks,
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Jun
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> > > > > >> > > > tcrayford@heroku.com
> > > > > >> > > > > > > > <javascript:;>> wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > > We've started running our usual suite of
> > performance
> > > > > tests
> > > > > >> > > > against
> > > > > >> > > > > > > Kafka
> > > > > >> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> > > > > >> > consumer/producer
> > > > > >> > > > > > > machines
> > > > > >> > > > > > > > to
> > > > > >> > > > > > > > > run a fairly normal mixed workload of producers
> > and
> > > > > >> consumers
> > > > > >> > > > (each
> > > > > >> > > > > > > > > producer/consumer are just instances of kafka's
> > > > inbuilt
> > > > > >> > > > > > > consumer/producer
> > > > > >> > > > > > > > > perf tests). We've found about a 33% performance
> > > drop
> > > > in
> > > > > >> the
> > > > > >> > > > > producer
> > > > > >> > > > > > > if
> > > > > >> > > > > > > > > TLS is used (compared to 0.9.0.1)
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > We've seen notable producer performance
> > degredations
> > > > > >> between
> > > > > >> > > > > 0.9.0.1
> > > > > >> > > > > > > and
> > > > > >> > > > > > > > > 0.10.0.0 RC. We're running as of the commit
> > 9404680
> > > > > right
> > > > > >> > now.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Our specific test case runs Kafka on 8 EC2
> > machines,
> > > > > with
> > > > > >> > > > enhanced
> > > > > >> > > > > > > > > networking. Nothing is changed between the
> > > instances,
> > > > > and
> > > > > >> > I've
> > > > > >> > > > > > > reproduced
> > > > > >> > > > > > > > > this over 4 different sets of clusters now.
> We're
> > > > seeing
> > > > > >> > about
> > > > > >> > > a
> > > > > >> > > > > 33%
> > > > > >> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as
> > of
> > > > > commit
> > > > > >> > > > 9404680.
> > > > > >> > > > > > > > Please
> > > > > >> > > > > > > > > to note that this doesn't match up with
> > > > > >> > > > > > > > >
> https://issues.apache.org/jira/browse/KAFKA-3565,
> > > > > because
> > > > > >> > our
> > > > > >> > > > > > > > performance
> > > > > >> > > > > > > > > tests are with compression off, and this seems
> to
> > be
> > > > an
> > > > > TLS
> > > > > >> > > only
> > > > > >> > > > > > issue.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with
> > > > > replication
> > > > > >> > > > factor
> > > > > >> > > > > of
> > > > > >> > > > > > > 3,
> > > > > >> > > > > > > > > and 13 producers max out at around 1 million 100
> > > byte
> > > > > >> > messages
> > > > > >> > > a
> > > > > >> > > > > > > second.
> > > > > >> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million
> > > > > messages a
> > > > > >> > > > second.
> > > > > >> > > > > > > Both
> > > > > >> > > > > > > > > tests were with TLS on. I've reproduced this on
> > > > multiple
> > > > > >> > > clusters
> > > > > >> > > > > now
> > > > > >> > > > > > > (5
> > > > > >> > > > > > > > or
> > > > > >> > > > > > > > > so of each version) to account for the inherent
> > > > > performance
> > > > > >> > > > > variance
> > > > > >> > > > > > of
> > > > > >> > > > > > > > > EC2. There's no notable performance difference
> > > without
> > > > > TLS
> > > > > >> on
> > > > > >> > > > these
> > > > > >> > > > > > > runs
> > > > > >> > > > > > > > -
> > > > > >> > > > > > > > > it appears to be an TLS regression entirely.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > A single producer with TLS under 0.10 does about
> > 75k
> > > > > >> > > messages/s.
> > > > > >> > > > > > Under
> > > > > >> > > > > > > > > 0.9.0.01 it does around 120k messages/s.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > The exact producer-perf line we're using is
> this:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > bin/kafka-producer-perf-test --topic "bench"
> > > > > --num-records
> > > > > >> > > > > > "500000000"
> > > > > >> > > > > > > > > --record-size "100" --throughput "100"
> > > > --producer-props
> > > > > >> > > acks="-1"
> > > > > >> > > > > > > > > bootstrap.servers=REDACTED
> > > > > ssl.keystore.location=client.jks
> > > > > >> > > > > > > > > ssl.keystore.password=REDACTED
> > > > > >> > > ssl.truststore.location=server.jks
> > > > > >> > > > > > > > > ssl.truststore.password=REDACTED
> > > > > >> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> > > > > >> > > security.protocol=SSL
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > We're using the same setup, machine type etc for
> > > each
> > > > > test
> > > > > >> > run.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > We've tried using both 0.9.0.1 producers and
> > > 0.10.0.0
> > > > > >> > producers
> > > > > >> > > > and
> > > > > >> > > > > > the
> > > > > >> > > > > > > > TLS
> > > > > >> > > > > > > > > performance impact was there for both.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > I've glanced over the code between 0.9.0.1 and
> > > > 0.10.0.0
> > > > > and
> > > > > >> > > > haven't
> > > > > >> > > > > > > seen
> > > > > >> > > > > > > > > anything that seemed to have this kind of
> impact -
> > > > > indeed
> > > > > >> the
> > > > > >> > > TLS
> > > > > >> > > > > > code
> > > > > >> > > > > > > > > doesn't seem to have changed much between
> 0.9.0.1
> > > and
> > > > > >> > 0.10.0.0.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Any thoughts? Should I file an issue and see
> about
> > > > > >> > reproducing
> > > > > >> > > a
> > > > > >> > > > > more
> > > > > >> > > > > > > > > minimal test case?
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > I don't think this is related to
> > > > > >> > > > > > > > >
> https://issues.apache.org/jira/browse/KAFKA-3565
> > -
> > > > > that is
> > > > > >> > for
> > > > > >> > > > > > > > compression
> > > > > >> > > > > > > > > on and plaintext, and this is for TLS only.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Ismael Juma <is...@juma.me.uk>.

Hi Tom,

Great to hear that the failure testing scenario went well. :)

Your suggested improvement sounds good to me and a PR would be great. For
this kind of change, you can skip the JIRA, just prefix the PR title with
`MINOR:`.

Thanks,
Ismael

On Sun, May 15, 2016 at 9:17 PM, Tom Crayford <tc...@heroku.com> wrote:

> How about this?
>
>     <b>Note:</b> Due to the additional timestamp introduced in each message
> (8 bytes of data), producers sending small messages may see a
>     message throughput degradation because of the increased overhead.
> Likewise, replication now transmits an additional 8 bytes per message.
>     If you're running close to the network capacity of your cluster, it's
> possible that you'll overwhelm the network cards and see failures and
> performance
>     issues due to the overload.
>     When receiving compressed messages, 0.10.0
>     brokers avoid recompressing the messages, which in general reduces the
> latency and improves the throughput. In
>     certain cases, this may reduce the batching size on the producer, which
> could lead to worse throughput. If this
>     happens, users can tune linger.ms and batch.size of the producer for
> better throughput.
>
> Would you like a Jira/PR with this kind of change so we can discuss them in
> a more convenient format?
>
> Re our failure testing scenario: Kafka 0.10 RC behaves exactly the same
> under failure as 0.9 - the controller typically shifts the leader in around
> 2 seconds or so, and the benchmark sees a small drop in throughput during
> that, then another drop whilst the replacement broker comes back to speed.
> So, overall we're extremely happy and excited for this release! Thanks to
> the committers and maintainers for all their hard work.
>
> On Sun, May 15, 2016 at 9:03 PM, Ismael Juma <is...@juma.me.uk> wrote:
>
> > Hi Tom,
> >
> > Thanks for the update and for all the testing you have done! No worries
> > about the chase here, I'd much rather have false positives by people who
> > are validating the releases than false negatives because people don't
> > validate the releases. :)
> >
> > The upgrade note we currently have follows:
> >
> > https://github.com/apache/kafka/blob/0.10.0/docs/upgrade.html#L67
> >
> > Please feel free to suggest improvements.
> >
> > Thanks,
> > Ismael
> >
> > On Sun, May 15, 2016 at 6:39 PM, Tom Crayford <tc...@heroku.com>
> > wrote:
> >
> > > I've been digging into this some more. It seems like this may have been
> > an
> > > issue with benchmarks maxing out the network card - under 0.10.0.0-RC
> the
> > > slightly additional bandwidth per message seems to have pushed the
> > broker's
> > > NIC into overload territory where it starts dropping packets (verified
> > with
> > > ifconfig on each broker). This leads to it not being able to talk to
> > > Zookeeper properly, which leads to OfflinePartitions, which then causes
> > > issues with the benchmarks validity, as throughput drops a lot when
> > brokers
> > > are flapping in and out of being online. 0.9.0.1 doing that 8 bytes
> less
> > > per message means the broker's NIC can sustain more messages/s. There
> was
> > > an "alignment" issue with the benchmarks here - under 0.9 we were
> *just*
> > at
> > > the barrier of the broker's NICs sustaining traffic, and under 0.10 we
> > > pushed over that (at 1.5 million messages/s, 8 bytes extra per message
> is
> > > an extra 36 MB/s with replication factor 3 [if my math is right, and
> > that's
> > > before SSL encryption which may be additional overhead], which is as
> much
> > > as an additional producer machine).
> > >
> > > The dropped packets and the flapping weren't causing notable timeout
> > issues
> > > in the producer, but looking at the metrics on the brokers, offline
> > > partitions was clearly triggered and undergoing, and the broker logs
> show
> > > ZK session timeouts. This is consistent with earlier benchmarking
> > > experience - the number of producers we were running under 0.9.0.1 was
> > > carefully selected to be just under the limit here.
> > >
> > > The other issue with the benchmark where I reported an issue between
> two
> > > single producers was caused by a "performance of producer machine"
> issue
> > > that I wasn't properly aware of. Apologies there.
> > >
> > > I've done benchmarks now where I limit the producer throughput (via
> > > --throughput) to slightly below what the NICs can sustain and seen no
> > > notable performance or stability difference between 0.10 and 0.9.0.1 as
> > > long as you stay under the limits of the network interfaces. All of the
> > > clusters I have tested happily keep up a benchmark at this rate for 6
> > hours
> > > under both 0.9.0.1 and 0.10.0.0. I've also verified that our clusters
> are
> > > entirely network bound in these producer benchmarking scenarios - the
> > disks
> > > and CPU/memory have a bunch of remaining capacity.
> > >
> > > This was pretty hard to verify fully, which is why I've taken so long
> to
> > > reply. All in all I think the result here is expected and not a blocker
> > for
> > > release, but a good thing to note on upgrades - if folk are running at
> > the
> > > limit of their network cards (which you never want to do anyway, but
> > > benchmarking scenarios often uncover those limits), they'll see issues
> > due
> > > to increased replication and producer traffic under 0.10.0.0.
> > >
> > > Apologies for the chase here - this distinctly seemed like a real issue
> > and
> > > one I (and I think everybody else) would have wanted to block the
> release
> > > on. I'm going to move onto our "failure" testing, in which we run the
> > same
> > > performance benchmarks whilst causing a hard kill on the node. We've
> seen
> > > very good results for that under 0.9 and hopefully they'll continue
> under
> > > 0.10.
> > >
> > > On Sat, May 14, 2016 at 1:33 AM, Gwen Shapira <gw...@confluent.io>
> wrote:
> > >
> > > > also, perhaps sharing the broker configuration? maybe this will
> > > > provide some hints...
> > > >
> > > > On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <is...@juma.me.uk>
> > wrote:
> > > > > Thanks Tom. I just wanted to share that I have been unable to
> > reproduce
> > > > > this so far. Please feel free to share whatever you information you
> > > have
> > > > so
> > > > > far when you have a chance, don't feel that you need to have all
> the
> > > > > answers.
> > > > >
> > > > > Ismael
> > > > >
> > > > > On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <
> tcrayford@heroku.com>
> > > > wrote:
> > > > >
> > > > >> I've been investigating this pretty hard since I first noticed it.
> > > Right
> > > > >> now I have more avenues for investigation than I can shake a stick
> > at,
> > > > and
> > > > >> am also dealing with several other things in flight/on fire. I'll
> > > > respond
> > > > >> when I have more information and can confirm things.
> > > > >>
> > > > >> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <becket.qin@gmail.com
> >
> > > > wrote:
> > > > >>
> > > > >> > Tom,
> > > > >> >
> > > > >> > Maybe it is mentioned and I missed. I am wondering if you see
> > > > performance
> > > > >> > degradation on the consumer side when TLS is used? This could
> help
> > > us
> > > > >> > understand whether the issue is only producer related or TLS in
> > > > general.
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Jiangjie (Becket) Qin
> > > > >> >
> > > > >> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <
> > tcrayford@heroku.com
> > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Ismael,
> > > > >> > >
> > > > >> > > Thanks. I'm writing up an issue with some new findings since
> > > > yesterday
> > > > >> > > right now.
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > >
> > > > >> > > Tom
> > > > >> > >
> > > > >> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <
> ismael@juma.me.uk
> > >
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi Tom,
> > > > >> > > >
> > > > >> > > > That's because JIRA is in lockdown due to excessive spam. I
> > have
> > > > >> added
> > > > >> > > you
> > > > >> > > > as a contributor in JIRA and you should be able to file a
> > ticket
> > > > now.
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > > Ismael
> > > > >> > > >
> > > > >> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <
> > > > tcrayford@heroku.com
> > > > >> >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Ok, I don't seem to be able to file a new Jira issue at
> all.
> > > Can
> > > > >> > > somebody
> > > > >> > > > > check my permissions on Jira? My user is
> `tcrayford-heroku`
> > > > >> > > > >
> > > > >> > > > > Tom Crayford
> > > > >> > > > > Heroku Kafka
> > > > >> > > > >
> > > > >> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <
> jun@confluent.io
> > >
> > > > >> wrote:
> > > > >> > > > >
> > > > >> > > > > > Tom,
> > > > >> > > > > >
> > > > >> > > > > > We don't have a CSV metrics reporter in the producer
> right
> > > > now.
> > > > >> The
> > > > >> > > > > metrics
> > > > >> > > > > > will be available in jmx. You can find out the details
> in
> > > > >> > > > > >
> > > > >>
> http://kafka.apache.org/documentation.html#new_producer_monitoring
> > > > >> > > > > >
> > > > >> > > > > > Thanks,
> > > > >> > > > > >
> > > > >> > > > > > Jun
> > > > >> > > > > >
> > > > >> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> > > > >> > tcrayford@heroku.com>
> > > > >> > > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Yep, I can try those particular commits tomorrow.
> > Before I
> > > > try
> > > > >> a
> > > > >> > > > > bisect,
> > > > >> > > > > > > I'm going to replicate with a less intensive to
> iterate
> > on
> > > > >> > smaller
> > > > >> > > > > scale
> > > > >> > > > > > > perf test.
> > > > >> > > > > > >
> > > > >> > > > > > > Jun, inline:
> > > > >> > > > > > >
> > > > >> > > > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Tom,
> > > > >> > > > > > > >
> > > > >> > > > > > > > Thanks for reporting this. A few quick comments.
> > > > >> > > > > > > >
> > > > >> > > > > > > > 1. Did you send the right command for producer-perf?
> > The
> > > > >> > command
> > > > >> > > > > limits
> > > > >> > > > > > > the
> > > > >> > > > > > > > throughput to 100 msgs/sec. So, not sure how a
> single
> > > > >> producer
> > > > >> > > can
> > > > >> > > > > get
> > > > >> > > > > > > 75K
> > > > >> > > > > > > > msgs/sec.
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > Ah yep, wrong commands. I'll get the right one
> tomorrow.
> > > > Sorry,
> > > > >> > was
> > > > >> > > > > > > interpolating variables into a shell script.
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > 2. Could you collect some stats (e.g. average batch
> > > size)
> > > > in
> > > > >> > the
> > > > >> > > > > > producer
> > > > >> > > > > > > > and see if there is any noticeable difference
> between
> > > 0.9
> > > > and
> > > > >> > > 0.10?
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > That'd just be hooking up the CSV metrics reporter
> > right?
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > 3. Is the broker-to-broker communication also on
> SSL?
> > > > Could
> > > > >> you
> > > > >> > > do
> > > > >> > > > > > > another
> > > > >> > > > > > > > test with replication factor 1 and see if you still
> > see
> > > > the
> > > > >> > > > > > degradation?
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > Interbroker replication is always SSL in all test runs
> > so
> > > > far.
> > > > >> I
> > > > >> > > can
> > > > >> > > > > try
> > > > >> > > > > > > with replication factor 1 tomorrow.
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > Finally, email is probably not the best way to
> discuss
> > > > >> > > performance
> > > > >> > > > > > > results.
> > > > >> > > > > > > > If you have more of them, could you create a jira
> and
> > > > attach
> > > > >> > your
> > > > >> > > > > > > findings
> > > > >> > > > > > > > there?
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > Yep. I only wrote the email because JIRA was in
> lockdown
> > > > mode
> > > > >> > and I
> > > > >> > > > > > > couldn't create new issues.
> > > > >> > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > Thanks,
> > > > >> > > > > > > >
> > > > >> > > > > > > > Jun
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> > > > >> > > > tcrayford@heroku.com
> > > > >> > > > > > > > <javascript:;>> wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > We've started running our usual suite of
> performance
> > > > tests
> > > > >> > > > against
> > > > >> > > > > > > Kafka
> > > > >> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> > > > >> > consumer/producer
> > > > >> > > > > > > machines
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > run a fairly normal mixed workload of producers
> and
> > > > >> consumers
> > > > >> > > > (each
> > > > >> > > > > > > > > producer/consumer are just instances of kafka's
> > > inbuilt
> > > > >> > > > > > > consumer/producer
> > > > >> > > > > > > > > perf tests). We've found about a 33% performance
> > drop
> > > in
> > > > >> the
> > > > >> > > > > producer
> > > > >> > > > > > > if
> > > > >> > > > > > > > > TLS is used (compared to 0.9.0.1)
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > We've seen notable producer performance
> degredations
> > > > >> between
> > > > >> > > > > 0.9.0.1
> > > > >> > > > > > > and
> > > > >> > > > > > > > > 0.10.0.0 RC. We're running as of the commit
> 9404680
> > > > right
> > > > >> > now.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Our specific test case runs Kafka on 8 EC2
> machines,
> > > > with
> > > > >> > > > enhanced
> > > > >> > > > > > > > > networking. Nothing is changed between the
> > instances,
> > > > and
> > > > >> > I've
> > > > >> > > > > > > reproduced
> > > > >> > > > > > > > > this over 4 different sets of clusters now. We're
> > > seeing
> > > > >> > about
> > > > >> > > a
> > > > >> > > > > 33%
> > > > >> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as
> of
> > > > commit
> > > > >> > > > 9404680.
> > > > >> > > > > > > > Please
> > > > >> > > > > > > > > to note that this doesn't match up with
> > > > >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565,
> > > > because
> > > > >> > our
> > > > >> > > > > > > > performance
> > > > >> > > > > > > > > tests are with compression off, and this seems to
> be
> > > an
> > > > TLS
> > > > >> > > only
> > > > >> > > > > > issue.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with
> > > > replication
> > > > >> > > > factor
> > > > >> > > > > of
> > > > >> > > > > > > 3,
> > > > >> > > > > > > > > and 13 producers max out at around 1 million 100
> > byte
> > > > >> > messages
> > > > >> > > a
> > > > >> > > > > > > second.
> > > > >> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million
> > > > messages a
> > > > >> > > > second.
> > > > >> > > > > > > Both
> > > > >> > > > > > > > > tests were with TLS on. I've reproduced this on
> > > multiple
> > > > >> > > clusters
> > > > >> > > > > now
> > > > >> > > > > > > (5
> > > > >> > > > > > > > or
> > > > >> > > > > > > > > so of each version) to account for the inherent
> > > > performance
> > > > >> > > > > variance
> > > > >> > > > > > of
> > > > >> > > > > > > > > EC2. There's no notable performance difference
> > without
> > > > TLS
> > > > >> on
> > > > >> > > > these
> > > > >> > > > > > > runs
> > > > >> > > > > > > > -
> > > > >> > > > > > > > > it appears to be an TLS regression entirely.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > A single producer with TLS under 0.10 does about
> 75k
> > > > >> > > messages/s.
> > > > >> > > > > > Under
> > > > >> > > > > > > > > 0.9.0.01 it does around 120k messages/s.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > The exact producer-perf line we're using is this:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > bin/kafka-producer-perf-test --topic "bench"
> > > > --num-records
> > > > >> > > > > > "500000000"
> > > > >> > > > > > > > > --record-size "100" --throughput "100"
> > > --producer-props
> > > > >> > > acks="-1"
> > > > >> > > > > > > > > bootstrap.servers=REDACTED
> > > > ssl.keystore.location=client.jks
> > > > >> > > > > > > > > ssl.keystore.password=REDACTED
> > > > >> > > ssl.truststore.location=server.jks
> > > > >> > > > > > > > > ssl.truststore.password=REDACTED
> > > > >> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> > > > >> > > security.protocol=SSL
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > We're using the same setup, machine type etc for
> > each
> > > > test
> > > > >> > run.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > We've tried using both 0.9.0.1 producers and
> > 0.10.0.0
> > > > >> > producers
> > > > >> > > > and
> > > > >> > > > > > the
> > > > >> > > > > > > > TLS
> > > > >> > > > > > > > > performance impact was there for both.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > I've glanced over the code between 0.9.0.1 and
> > > 0.10.0.0
> > > > and
> > > > >> > > > haven't
> > > > >> > > > > > > seen
> > > > >> > > > > > > > > anything that seemed to have this kind of impact -
> > > > indeed
> > > > >> the
> > > > >> > > TLS
> > > > >> > > > > > code
> > > > >> > > > > > > > > doesn't seem to have changed much between 0.9.0.1
> > and
> > > > >> > 0.10.0.0.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Any thoughts? Should I file an issue and see about
> > > > >> > reproducing
> > > > >> > > a
> > > > >> > > > > more
> > > > >> > > > > > > > > minimal test case?
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > I don't think this is related to
> > > > >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565
> -
> > > > that is
> > > > >> > for
> > > > >> > > > > > > > compression
> > > > >> > > > > > > > > on and plaintext, and this is for TLS only.
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

How about this?

    <b>Note:</b> Due to the additional timestamp introduced in each message
(8 bytes of data), producers sending small messages may see a
    message throughput degradation because of the increased overhead.
Likewise, replication now transmits an additional 8 bytes per message.
    If you're running close to the network capacity of your cluster, it's
possible that you'll overwhelm the network cards and see failures and
performance
    issues due to the overload.
    When receiving compressed messages, 0.10.0
    brokers avoid recompressing the messages, which in general reduces the
latency and improves the throughput. In
    certain cases, this may reduce the batching size on the producer, which
could lead to worse throughput. If this
    happens, users can tune linger.ms and batch.size of the producer for
better throughput.

Would you like a Jira/PR with this kind of change so we can discuss them in
a more convenient format?

Re our failure testing scenario: Kafka 0.10 RC behaves exactly the same
under failure as 0.9 - the controller typically shifts the leader in around
2 seconds or so, and the benchmark sees a small drop in throughput during
that, then another drop whilst the replacement broker comes back to speed.
So, overall we're extremely happy and excited for this release! Thanks to
the committers and maintainers for all their hard work.

On Sun, May 15, 2016 at 9:03 PM, Ismael Juma <is...@juma.me.uk> wrote:

> Hi Tom,
>
> Thanks for the update and for all the testing you have done! No worries
> about the chase here, I'd much rather have false positives by people who
> are validating the releases than false negatives because people don't
> validate the releases. :)
>
> The upgrade note we currently have follows:
>
> https://github.com/apache/kafka/blob/0.10.0/docs/upgrade.html#L67
>
> Please feel free to suggest improvements.
>
> Thanks,
> Ismael
>
> On Sun, May 15, 2016 at 6:39 PM, Tom Crayford <tc...@heroku.com>
> wrote:
>
> > I've been digging into this some more. It seems like this may have been
> an
> > issue with benchmarks maxing out the network card - under 0.10.0.0-RC the
> > slightly additional bandwidth per message seems to have pushed the
> broker's
> > NIC into overload territory where it starts dropping packets (verified
> with
> > ifconfig on each broker). This leads to it not being able to talk to
> > Zookeeper properly, which leads to OfflinePartitions, which then causes
> > issues with the benchmarks validity, as throughput drops a lot when
> brokers
> > are flapping in and out of being online. 0.9.0.1 doing that 8 bytes less
> > per message means the broker's NIC can sustain more messages/s. There was
> > an "alignment" issue with the benchmarks here - under 0.9 we were *just*
> at
> > the barrier of the broker's NICs sustaining traffic, and under 0.10 we
> > pushed over that (at 1.5 million messages/s, 8 bytes extra per message is
> > an extra 36 MB/s with replication factor 3 [if my math is right, and
> that's
> > before SSL encryption which may be additional overhead], which is as much
> > as an additional producer machine).
> >
> > The dropped packets and the flapping weren't causing notable timeout
> issues
> > in the producer, but looking at the metrics on the brokers, offline
> > partitions was clearly triggered and undergoing, and the broker logs show
> > ZK session timeouts. This is consistent with earlier benchmarking
> > experience - the number of producers we were running under 0.9.0.1 was
> > carefully selected to be just under the limit here.
> >
> > The other issue with the benchmark where I reported an issue between two
> > single producers was caused by a "performance of producer machine" issue
> > that I wasn't properly aware of. Apologies there.
> >
> > I've done benchmarks now where I limit the producer throughput (via
> > --throughput) to slightly below what the NICs can sustain and seen no
> > notable performance or stability difference between 0.10 and 0.9.0.1 as
> > long as you stay under the limits of the network interfaces. All of the
> > clusters I have tested happily keep up a benchmark at this rate for 6
> hours
> > under both 0.9.0.1 and 0.10.0.0. I've also verified that our clusters are
> > entirely network bound in these producer benchmarking scenarios - the
> disks
> > and CPU/memory have a bunch of remaining capacity.
> >
> > This was pretty hard to verify fully, which is why I've taken so long to
> > reply. All in all I think the result here is expected and not a blocker
> for
> > release, but a good thing to note on upgrades - if folk are running at
> the
> > limit of their network cards (which you never want to do anyway, but
> > benchmarking scenarios often uncover those limits), they'll see issues
> due
> > to increased replication and producer traffic under 0.10.0.0.
> >
> > Apologies for the chase here - this distinctly seemed like a real issue
> and
> > one I (and I think everybody else) would have wanted to block the release
> > on. I'm going to move onto our "failure" testing, in which we run the
> same
> > performance benchmarks whilst causing a hard kill on the node. We've seen
> > very good results for that under 0.9 and hopefully they'll continue under
> > 0.10.
> >
> > On Sat, May 14, 2016 at 1:33 AM, Gwen Shapira <gw...@confluent.io> wrote:
> >
> > > also, perhaps sharing the broker configuration? maybe this will
> > > provide some hints...
> > >
> > > On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <is...@juma.me.uk>
> wrote:
> > > > Thanks Tom. I just wanted to share that I have been unable to
> reproduce
> > > > this so far. Please feel free to share whatever you information you
> > have
> > > so
> > > > far when you have a chance, don't feel that you need to have all the
> > > > answers.
> > > >
> > > > Ismael
> > > >
> > > > On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <tc...@heroku.com>
> > > wrote:
> > > >
> > > >> I've been investigating this pretty hard since I first noticed it.
> > Right
> > > >> now I have more avenues for investigation than I can shake a stick
> at,
> > > and
> > > >> am also dealing with several other things in flight/on fire. I'll
> > > respond
> > > >> when I have more information and can confirm things.
> > > >>
> > > >> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >>
> > > >> > Tom,
> > > >> >
> > > >> > Maybe it is mentioned and I missed. I am wondering if you see
> > > performance
> > > >> > degradation on the consumer side when TLS is used? This could help
> > us
> > > >> > understand whether the issue is only producer related or TLS in
> > > general.
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jiangjie (Becket) Qin
> > > >> >
> > > >> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <
> tcrayford@heroku.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Ismael,
> > > >> > >
> > > >> > > Thanks. I'm writing up an issue with some new findings since
> > > yesterday
> > > >> > > right now.
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > > Tom
> > > >> > >
> > > >> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <ismael@juma.me.uk
> >
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi Tom,
> > > >> > > >
> > > >> > > > That's because JIRA is in lockdown due to excessive spam. I
> have
> > > >> added
> > > >> > > you
> > > >> > > > as a contributor in JIRA and you should be able to file a
> ticket
> > > now.
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > > Ismael
> > > >> > > >
> > > >> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <
> > > tcrayford@heroku.com
> > > >> >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Ok, I don't seem to be able to file a new Jira issue at all.
> > Can
> > > >> > > somebody
> > > >> > > > > check my permissions on Jira? My user is `tcrayford-heroku`
> > > >> > > > >
> > > >> > > > > Tom Crayford
> > > >> > > > > Heroku Kafka
> > > >> > > > >
> > > >> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <jun@confluent.io
> >
> > > >> wrote:
> > > >> > > > >
> > > >> > > > > > Tom,
> > > >> > > > > >
> > > >> > > > > > We don't have a CSV metrics reporter in the producer right
> > > now.
> > > >> The
> > > >> > > > > metrics
> > > >> > > > > > will be available in jmx. You can find out the details in
> > > >> > > > > >
> > > >> http://kafka.apache.org/documentation.html#new_producer_monitoring
> > > >> > > > > >
> > > >> > > > > > Thanks,
> > > >> > > > > >
> > > >> > > > > > Jun
> > > >> > > > > >
> > > >> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> > > >> > tcrayford@heroku.com>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Yep, I can try those particular commits tomorrow.
> Before I
> > > try
> > > >> a
> > > >> > > > > bisect,
> > > >> > > > > > > I'm going to replicate with a less intensive to iterate
> on
> > > >> > smaller
> > > >> > > > > scale
> > > >> > > > > > > perf test.
> > > >> > > > > > >
> > > >> > > > > > > Jun, inline:
> > > >> > > > > > >
> > > >> > > > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io>
> > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Tom,
> > > >> > > > > > > >
> > > >> > > > > > > > Thanks for reporting this. A few quick comments.
> > > >> > > > > > > >
> > > >> > > > > > > > 1. Did you send the right command for producer-perf?
> The
> > > >> > command
> > > >> > > > > limits
> > > >> > > > > > > the
> > > >> > > > > > > > throughput to 100 msgs/sec. So, not sure how a single
> > > >> producer
> > > >> > > can
> > > >> > > > > get
> > > >> > > > > > > 75K
> > > >> > > > > > > > msgs/sec.
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > Ah yep, wrong commands. I'll get the right one tomorrow.
> > > Sorry,
> > > >> > was
> > > >> > > > > > > interpolating variables into a shell script.
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > 2. Could you collect some stats (e.g. average batch
> > size)
> > > in
> > > >> > the
> > > >> > > > > > producer
> > > >> > > > > > > > and see if there is any noticeable difference between
> > 0.9
> > > and
> > > >> > > 0.10?
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > That'd just be hooking up the CSV metrics reporter
> right?
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > 3. Is the broker-to-broker communication also on SSL?
> > > Could
> > > >> you
> > > >> > > do
> > > >> > > > > > > another
> > > >> > > > > > > > test with replication factor 1 and see if you still
> see
> > > the
> > > >> > > > > > degradation?
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > Interbroker replication is always SSL in all test runs
> so
> > > far.
> > > >> I
> > > >> > > can
> > > >> > > > > try
> > > >> > > > > > > with replication factor 1 tomorrow.
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > Finally, email is probably not the best way to discuss
> > > >> > > performance
> > > >> > > > > > > results.
> > > >> > > > > > > > If you have more of them, could you create a jira and
> > > attach
> > > >> > your
> > > >> > > > > > > findings
> > > >> > > > > > > > there?
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > Yep. I only wrote the email because JIRA was in lockdown
> > > mode
> > > >> > and I
> > > >> > > > > > > couldn't create new issues.
> > > >> > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > Thanks,
> > > >> > > > > > > >
> > > >> > > > > > > > Jun
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> > > >> > > > tcrayford@heroku.com
> > > >> > > > > > > > <javascript:;>> wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > We've started running our usual suite of performance
> > > tests
> > > >> > > > against
> > > >> > > > > > > Kafka
> > > >> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> > > >> > consumer/producer
> > > >> > > > > > > machines
> > > >> > > > > > > > to
> > > >> > > > > > > > > run a fairly normal mixed workload of producers and
> > > >> consumers
> > > >> > > > (each
> > > >> > > > > > > > > producer/consumer are just instances of kafka's
> > inbuilt
> > > >> > > > > > > consumer/producer
> > > >> > > > > > > > > perf tests). We've found about a 33% performance
> drop
> > in
> > > >> the
> > > >> > > > > producer
> > > >> > > > > > > if
> > > >> > > > > > > > > TLS is used (compared to 0.9.0.1)
> > > >> > > > > > > > >
> > > >> > > > > > > > > We've seen notable producer performance degredations
> > > >> between
> > > >> > > > > 0.9.0.1
> > > >> > > > > > > and
> > > >> > > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680
> > > right
> > > >> > now.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Our specific test case runs Kafka on 8 EC2 machines,
> > > with
> > > >> > > > enhanced
> > > >> > > > > > > > > networking. Nothing is changed between the
> instances,
> > > and
> > > >> > I've
> > > >> > > > > > > reproduced
> > > >> > > > > > > > > this over 4 different sets of clusters now. We're
> > seeing
> > > >> > about
> > > >> > > a
> > > >> > > > > 33%
> > > >> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of
> > > commit
> > > >> > > > 9404680.
> > > >> > > > > > > > Please
> > > >> > > > > > > > > to note that this doesn't match up with
> > > >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565,
> > > because
> > > >> > our
> > > >> > > > > > > > performance
> > > >> > > > > > > > > tests are with compression off, and this seems to be
> > an
> > > TLS
> > > >> > > only
> > > >> > > > > > issue.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with
> > > replication
> > > >> > > > factor
> > > >> > > > > of
> > > >> > > > > > > 3,
> > > >> > > > > > > > > and 13 producers max out at around 1 million 100
> byte
> > > >> > messages
> > > >> > > a
> > > >> > > > > > > second.
> > > >> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million
> > > messages a
> > > >> > > > second.
> > > >> > > > > > > Both
> > > >> > > > > > > > > tests were with TLS on. I've reproduced this on
> > multiple
> > > >> > > clusters
> > > >> > > > > now
> > > >> > > > > > > (5
> > > >> > > > > > > > or
> > > >> > > > > > > > > so of each version) to account for the inherent
> > > performance
> > > >> > > > > variance
> > > >> > > > > > of
> > > >> > > > > > > > > EC2. There's no notable performance difference
> without
> > > TLS
> > > >> on
> > > >> > > > these
> > > >> > > > > > > runs
> > > >> > > > > > > > -
> > > >> > > > > > > > > it appears to be an TLS regression entirely.
> > > >> > > > > > > > >
> > > >> > > > > > > > > A single producer with TLS under 0.10 does about 75k
> > > >> > > messages/s.
> > > >> > > > > > Under
> > > >> > > > > > > > > 0.9.0.01 it does around 120k messages/s.
> > > >> > > > > > > > >
> > > >> > > > > > > > > The exact producer-perf line we're using is this:
> > > >> > > > > > > > >
> > > >> > > > > > > > > bin/kafka-producer-perf-test --topic "bench"
> > > --num-records
> > > >> > > > > > "500000000"
> > > >> > > > > > > > > --record-size "100" --throughput "100"
> > --producer-props
> > > >> > > acks="-1"
> > > >> > > > > > > > > bootstrap.servers=REDACTED
> > > ssl.keystore.location=client.jks
> > > >> > > > > > > > > ssl.keystore.password=REDACTED
> > > >> > > ssl.truststore.location=server.jks
> > > >> > > > > > > > > ssl.truststore.password=REDACTED
> > > >> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> > > >> > > security.protocol=SSL
> > > >> > > > > > > > >
> > > >> > > > > > > > > We're using the same setup, machine type etc for
> each
> > > test
> > > >> > run.
> > > >> > > > > > > > >
> > > >> > > > > > > > > We've tried using both 0.9.0.1 producers and
> 0.10.0.0
> > > >> > producers
> > > >> > > > and
> > > >> > > > > > the
> > > >> > > > > > > > TLS
> > > >> > > > > > > > > performance impact was there for both.
> > > >> > > > > > > > >
> > > >> > > > > > > > > I've glanced over the code between 0.9.0.1 and
> > 0.10.0.0
> > > and
> > > >> > > > haven't
> > > >> > > > > > > seen
> > > >> > > > > > > > > anything that seemed to have this kind of impact -
> > > indeed
> > > >> the
> > > >> > > TLS
> > > >> > > > > > code
> > > >> > > > > > > > > doesn't seem to have changed much between 0.9.0.1
> and
> > > >> > 0.10.0.0.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Any thoughts? Should I file an issue and see about
> > > >> > reproducing
> > > >> > > a
> > > >> > > > > more
> > > >> > > > > > > > > minimal test case?
> > > >> > > > > > > > >
> > > >> > > > > > > > > I don't think this is related to
> > > >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 -
> > > that is
> > > >> > for
> > > >> > > > > > > > compression
> > > >> > > > > > > > > on and plaintext, and this is for TLS only.
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Ismael Juma <is...@juma.me.uk>.

Hi Tom,

Thanks for the update and for all the testing you have done! No worries
about the chase here, I'd much rather have false positives by people who
are validating the releases than false negatives because people don't
validate the releases. :)

The upgrade note we currently have follows:

https://github.com/apache/kafka/blob/0.10.0/docs/upgrade.html#L67

Please feel free to suggest improvements.

Thanks,
Ismael

On Sun, May 15, 2016 at 6:39 PM, Tom Crayford <tc...@heroku.com> wrote:

> I've been digging into this some more. It seems like this may have been an
> issue with benchmarks maxing out the network card - under 0.10.0.0-RC the
> slightly additional bandwidth per message seems to have pushed the broker's
> NIC into overload territory where it starts dropping packets (verified with
> ifconfig on each broker). This leads to it not being able to talk to
> Zookeeper properly, which leads to OfflinePartitions, which then causes
> issues with the benchmarks validity, as throughput drops a lot when brokers
> are flapping in and out of being online. 0.9.0.1 doing that 8 bytes less
> per message means the broker's NIC can sustain more messages/s. There was
> an "alignment" issue with the benchmarks here - under 0.9 we were *just* at
> the barrier of the broker's NICs sustaining traffic, and under 0.10 we
> pushed over that (at 1.5 million messages/s, 8 bytes extra per message is
> an extra 36 MB/s with replication factor 3 [if my math is right, and that's
> before SSL encryption which may be additional overhead], which is as much
> as an additional producer machine).
>
> The dropped packets and the flapping weren't causing notable timeout issues
> in the producer, but looking at the metrics on the brokers, offline
> partitions was clearly triggered and undergoing, and the broker logs show
> ZK session timeouts. This is consistent with earlier benchmarking
> experience - the number of producers we were running under 0.9.0.1 was
> carefully selected to be just under the limit here.
>
> The other issue with the benchmark where I reported an issue between two
> single producers was caused by a "performance of producer machine" issue
> that I wasn't properly aware of. Apologies there.
>
> I've done benchmarks now where I limit the producer throughput (via
> --throughput) to slightly below what the NICs can sustain and seen no
> notable performance or stability difference between 0.10 and 0.9.0.1 as
> long as you stay under the limits of the network interfaces. All of the
> clusters I have tested happily keep up a benchmark at this rate for 6 hours
> under both 0.9.0.1 and 0.10.0.0. I've also verified that our clusters are
> entirely network bound in these producer benchmarking scenarios - the disks
> and CPU/memory have a bunch of remaining capacity.
>
> This was pretty hard to verify fully, which is why I've taken so long to
> reply. All in all I think the result here is expected and not a blocker for
> release, but a good thing to note on upgrades - if folk are running at the
> limit of their network cards (which you never want to do anyway, but
> benchmarking scenarios often uncover those limits), they'll see issues due
> to increased replication and producer traffic under 0.10.0.0.
>
> Apologies for the chase here - this distinctly seemed like a real issue and
> one I (and I think everybody else) would have wanted to block the release
> on. I'm going to move onto our "failure" testing, in which we run the same
> performance benchmarks whilst causing a hard kill on the node. We've seen
> very good results for that under 0.9 and hopefully they'll continue under
> 0.10.
>
> On Sat, May 14, 2016 at 1:33 AM, Gwen Shapira <gw...@confluent.io> wrote:
>
> > also, perhaps sharing the broker configuration? maybe this will
> > provide some hints...
> >
> > On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <is...@juma.me.uk> wrote:
> > > Thanks Tom. I just wanted to share that I have been unable to reproduce
> > > this so far. Please feel free to share whatever you information you
> have
> > so
> > > far when you have a chance, don't feel that you need to have all the
> > > answers.
> > >
> > > Ismael
> > >
> > > On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <tc...@heroku.com>
> > wrote:
> > >
> > >> I've been investigating this pretty hard since I first noticed it.
> Right
> > >> now I have more avenues for investigation than I can shake a stick at,
> > and
> > >> am also dealing with several other things in flight/on fire. I'll
> > respond
> > >> when I have more information and can confirm things.
> > >>
> > >> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <be...@gmail.com>
> > wrote:
> > >>
> > >> > Tom,
> > >> >
> > >> > Maybe it is mentioned and I missed. I am wondering if you see
> > performance
> > >> > degradation on the consumer side when TLS is used? This could help
> us
> > >> > understand whether the issue is only producer related or TLS in
> > general.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jiangjie (Becket) Qin
> > >> >
> > >> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tcrayford@heroku.com
> >
> > >> > wrote:
> > >> >
> > >> > > Ismael,
> > >> > >
> > >> > > Thanks. I'm writing up an issue with some new findings since
> > yesterday
> > >> > > right now.
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > Tom
> > >> > >
> > >> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <is...@juma.me.uk>
> > >> wrote:
> > >> > >
> > >> > > > Hi Tom,
> > >> > > >
> > >> > > > That's because JIRA is in lockdown due to excessive spam. I have
> > >> added
> > >> > > you
> > >> > > > as a contributor in JIRA and you should be able to file a ticket
> > now.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Ismael
> > >> > > >
> > >> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <
> > tcrayford@heroku.com
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Ok, I don't seem to be able to file a new Jira issue at all.
> Can
> > >> > > somebody
> > >> > > > > check my permissions on Jira? My user is `tcrayford-heroku`
> > >> > > > >
> > >> > > > > Tom Crayford
> > >> > > > > Heroku Kafka
> > >> > > > >
> > >> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io>
> > >> wrote:
> > >> > > > >
> > >> > > > > > Tom,
> > >> > > > > >
> > >> > > > > > We don't have a CSV metrics reporter in the producer right
> > now.
> > >> The
> > >> > > > > metrics
> > >> > > > > > will be available in jmx. You can find out the details in
> > >> > > > > >
> > >> http://kafka.apache.org/documentation.html#new_producer_monitoring
> > >> > > > > >
> > >> > > > > > Thanks,
> > >> > > > > >
> > >> > > > > > Jun
> > >> > > > > >
> > >> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> > >> > tcrayford@heroku.com>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Yep, I can try those particular commits tomorrow. Before I
> > try
> > >> a
> > >> > > > > bisect,
> > >> > > > > > > I'm going to replicate with a less intensive to iterate on
> > >> > smaller
> > >> > > > > scale
> > >> > > > > > > perf test.
> > >> > > > > > >
> > >> > > > > > > Jun, inline:
> > >> > > > > > >
> > >> > > > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io>
> wrote:
> > >> > > > > > >
> > >> > > > > > > > Tom,
> > >> > > > > > > >
> > >> > > > > > > > Thanks for reporting this. A few quick comments.
> > >> > > > > > > >
> > >> > > > > > > > 1. Did you send the right command for producer-perf? The
> > >> > command
> > >> > > > > limits
> > >> > > > > > > the
> > >> > > > > > > > throughput to 100 msgs/sec. So, not sure how a single
> > >> producer
> > >> > > can
> > >> > > > > get
> > >> > > > > > > 75K
> > >> > > > > > > > msgs/sec.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Ah yep, wrong commands. I'll get the right one tomorrow.
> > Sorry,
> > >> > was
> > >> > > > > > > interpolating variables into a shell script.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > 2. Could you collect some stats (e.g. average batch
> size)
> > in
> > >> > the
> > >> > > > > > producer
> > >> > > > > > > > and see if there is any noticeable difference between
> 0.9
> > and
> > >> > > 0.10?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > That'd just be hooking up the CSV metrics reporter right?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > 3. Is the broker-to-broker communication also on SSL?
> > Could
> > >> you
> > >> > > do
> > >> > > > > > > another
> > >> > > > > > > > test with replication factor 1 and see if you still see
> > the
> > >> > > > > > degradation?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Interbroker replication is always SSL in all test runs so
> > far.
> > >> I
> > >> > > can
> > >> > > > > try
> > >> > > > > > > with replication factor 1 tomorrow.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > Finally, email is probably not the best way to discuss
> > >> > > performance
> > >> > > > > > > results.
> > >> > > > > > > > If you have more of them, could you create a jira and
> > attach
> > >> > your
> > >> > > > > > > findings
> > >> > > > > > > > there?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Yep. I only wrote the email because JIRA was in lockdown
> > mode
> > >> > and I
> > >> > > > > > > couldn't create new issues.
> > >> > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > Thanks,
> > >> > > > > > > >
> > >> > > > > > > > Jun
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> > >> > > > tcrayford@heroku.com
> > >> > > > > > > > <javascript:;>> wrote:
> > >> > > > > > > >
> > >> > > > > > > > > We've started running our usual suite of performance
> > tests
> > >> > > > against
> > >> > > > > > > Kafka
> > >> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> > >> > consumer/producer
> > >> > > > > > > machines
> > >> > > > > > > > to
> > >> > > > > > > > > run a fairly normal mixed workload of producers and
> > >> consumers
> > >> > > > (each
> > >> > > > > > > > > producer/consumer are just instances of kafka's
> inbuilt
> > >> > > > > > > consumer/producer
> > >> > > > > > > > > perf tests). We've found about a 33% performance drop
> in
> > >> the
> > >> > > > > producer
> > >> > > > > > > if
> > >> > > > > > > > > TLS is used (compared to 0.9.0.1)
> > >> > > > > > > > >
> > >> > > > > > > > > We've seen notable producer performance degredations
> > >> between
> > >> > > > > 0.9.0.1
> > >> > > > > > > and
> > >> > > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680
> > right
> > >> > now.
> > >> > > > > > > > >
> > >> > > > > > > > > Our specific test case runs Kafka on 8 EC2 machines,
> > with
> > >> > > > enhanced
> > >> > > > > > > > > networking. Nothing is changed between the instances,
> > and
> > >> > I've
> > >> > > > > > > reproduced
> > >> > > > > > > > > this over 4 different sets of clusters now. We're
> seeing
> > >> > about
> > >> > > a
> > >> > > > > 33%
> > >> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of
> > commit
> > >> > > > 9404680.
> > >> > > > > > > > Please
> > >> > > > > > > > > to note that this doesn't match up with
> > >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565,
> > because
> > >> > our
> > >> > > > > > > > performance
> > >> > > > > > > > > tests are with compression off, and this seems to be
> an
> > TLS
> > >> > > only
> > >> > > > > > issue.
> > >> > > > > > > > >
> > >> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with
> > replication
> > >> > > > factor
> > >> > > > > of
> > >> > > > > > > 3,
> > >> > > > > > > > > and 13 producers max out at around 1 million 100 byte
> > >> > messages
> > >> > > a
> > >> > > > > > > second.
> > >> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million
> > messages a
> > >> > > > second.
> > >> > > > > > > Both
> > >> > > > > > > > > tests were with TLS on. I've reproduced this on
> multiple
> > >> > > clusters
> > >> > > > > now
> > >> > > > > > > (5
> > >> > > > > > > > or
> > >> > > > > > > > > so of each version) to account for the inherent
> > performance
> > >> > > > > variance
> > >> > > > > > of
> > >> > > > > > > > > EC2. There's no notable performance difference without
> > TLS
> > >> on
> > >> > > > these
> > >> > > > > > > runs
> > >> > > > > > > > -
> > >> > > > > > > > > it appears to be an TLS regression entirely.
> > >> > > > > > > > >
> > >> > > > > > > > > A single producer with TLS under 0.10 does about 75k
> > >> > > messages/s.
> > >> > > > > > Under
> > >> > > > > > > > > 0.9.0.01 it does around 120k messages/s.
> > >> > > > > > > > >
> > >> > > > > > > > > The exact producer-perf line we're using is this:
> > >> > > > > > > > >
> > >> > > > > > > > > bin/kafka-producer-perf-test --topic "bench"
> > --num-records
> > >> > > > > > "500000000"
> > >> > > > > > > > > --record-size "100" --throughput "100"
> --producer-props
> > >> > > acks="-1"
> > >> > > > > > > > > bootstrap.servers=REDACTED
> > ssl.keystore.location=client.jks
> > >> > > > > > > > > ssl.keystore.password=REDACTED
> > >> > > ssl.truststore.location=server.jks
> > >> > > > > > > > > ssl.truststore.password=REDACTED
> > >> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> > >> > > security.protocol=SSL
> > >> > > > > > > > >
> > >> > > > > > > > > We're using the same setup, machine type etc for each
> > test
> > >> > run.
> > >> > > > > > > > >
> > >> > > > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0
> > >> > producers
> > >> > > > and
> > >> > > > > > the
> > >> > > > > > > > TLS
> > >> > > > > > > > > performance impact was there for both.
> > >> > > > > > > > >
> > >> > > > > > > > > I've glanced over the code between 0.9.0.1 and
> 0.10.0.0
> > and
> > >> > > > haven't
> > >> > > > > > > seen
> > >> > > > > > > > > anything that seemed to have this kind of impact -
> > indeed
> > >> the
> > >> > > TLS
> > >> > > > > > code
> > >> > > > > > > > > doesn't seem to have changed much between 0.9.0.1 and
> > >> > 0.10.0.0.
> > >> > > > > > > > >
> > >> > > > > > > > > Any thoughts? Should I file an issue and see about
> > >> > reproducing
> > >> > > a
> > >> > > > > more
> > >> > > > > > > > > minimal test case?
> > >> > > > > > > > >
> > >> > > > > > > > > I don't think this is related to
> > >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 -
> > that is
> > >> > for
> > >> > > > > > > > compression
> > >> > > > > > > > > on and plaintext, and this is for TLS only.
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

I've been digging into this some more. It seems like this may have been an
issue with benchmarks maxing out the network card - under 0.10.0.0-RC the
slightly additional bandwidth per message seems to have pushed the broker's
NIC into overload territory where it starts dropping packets (verified with
ifconfig on each broker). This leads to it not being able to talk to
Zookeeper properly, which leads to OfflinePartitions, which then causes
issues with the benchmarks validity, as throughput drops a lot when brokers
are flapping in and out of being online. 0.9.0.1 doing that 8 bytes less
per message means the broker's NIC can sustain more messages/s. There was
an "alignment" issue with the benchmarks here - under 0.9 we were *just* at
the barrier of the broker's NICs sustaining traffic, and under 0.10 we
pushed over that (at 1.5 million messages/s, 8 bytes extra per message is
an extra 36 MB/s with replication factor 3 [if my math is right, and that's
before SSL encryption which may be additional overhead], which is as much
as an additional producer machine).

The dropped packets and the flapping weren't causing notable timeout issues
in the producer, but looking at the metrics on the brokers, offline
partitions was clearly triggered and undergoing, and the broker logs show
ZK session timeouts. This is consistent with earlier benchmarking
experience - the number of producers we were running under 0.9.0.1 was
carefully selected to be just under the limit here.

The other issue with the benchmark where I reported an issue between two
single producers was caused by a "performance of producer machine" issue
that I wasn't properly aware of. Apologies there.

I've done benchmarks now where I limit the producer throughput (via
--throughput) to slightly below what the NICs can sustain and seen no
notable performance or stability difference between 0.10 and 0.9.0.1 as
long as you stay under the limits of the network interfaces. All of the
clusters I have tested happily keep up a benchmark at this rate for 6 hours
under both 0.9.0.1 and 0.10.0.0. I've also verified that our clusters are
entirely network bound in these producer benchmarking scenarios - the disks
and CPU/memory have a bunch of remaining capacity.

This was pretty hard to verify fully, which is why I've taken so long to
reply. All in all I think the result here is expected and not a blocker for
release, but a good thing to note on upgrades - if folk are running at the
limit of their network cards (which you never want to do anyway, but
benchmarking scenarios often uncover those limits), they'll see issues due
to increased replication and producer traffic under 0.10.0.0.

Apologies for the chase here - this distinctly seemed like a real issue and
one I (and I think everybody else) would have wanted to block the release
on. I'm going to move onto our "failure" testing, in which we run the same
performance benchmarks whilst causing a hard kill on the node. We've seen
very good results for that under 0.9 and hopefully they'll continue under
0.10.

On Sat, May 14, 2016 at 1:33 AM, Gwen Shapira <gw...@confluent.io> wrote:

> also, perhaps sharing the broker configuration? maybe this will
> provide some hints...
>
> On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <is...@juma.me.uk> wrote:
> > Thanks Tom. I just wanted to share that I have been unable to reproduce
> > this so far. Please feel free to share whatever you information you have
> so
> > far when you have a chance, don't feel that you need to have all the
> > answers.
> >
> > Ismael
> >
> > On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <tc...@heroku.com>
> wrote:
> >
> >> I've been investigating this pretty hard since I first noticed it. Right
> >> now I have more avenues for investigation than I can shake a stick at,
> and
> >> am also dealing with several other things in flight/on fire. I'll
> respond
> >> when I have more information and can confirm things.
> >>
> >> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <be...@gmail.com>
> wrote:
> >>
> >> > Tom,
> >> >
> >> > Maybe it is mentioned and I missed. I am wondering if you see
> performance
> >> > degradation on the consumer side when TLS is used? This could help us
> >> > understand whether the issue is only producer related or TLS in
> general.
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tc...@heroku.com>
> >> > wrote:
> >> >
> >> > > Ismael,
> >> > >
> >> > > Thanks. I'm writing up an issue with some new findings since
> yesterday
> >> > > right now.
> >> > >
> >> > > Thanks
> >> > >
> >> > > Tom
> >> > >
> >> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <is...@juma.me.uk>
> >> wrote:
> >> > >
> >> > > > Hi Tom,
> >> > > >
> >> > > > That's because JIRA is in lockdown due to excessive spam. I have
> >> added
> >> > > you
> >> > > > as a contributor in JIRA and you should be able to file a ticket
> now.
> >> > > >
> >> > > > Thanks,
> >> > > > Ismael
> >> > > >
> >> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <
> tcrayford@heroku.com
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Ok, I don't seem to be able to file a new Jira issue at all. Can
> >> > > somebody
> >> > > > > check my permissions on Jira? My user is `tcrayford-heroku`
> >> > > > >
> >> > > > > Tom Crayford
> >> > > > > Heroku Kafka
> >> > > > >
> >> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io>
> >> wrote:
> >> > > > >
> >> > > > > > Tom,
> >> > > > > >
> >> > > > > > We don't have a CSV metrics reporter in the producer right
> now.
> >> The
> >> > > > > metrics
> >> > > > > > will be available in jmx. You can find out the details in
> >> > > > > >
> >> http://kafka.apache.org/documentation.html#new_producer_monitoring
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > >
> >> > > > > > Jun
> >> > > > > >
> >> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> >> > tcrayford@heroku.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Yep, I can try those particular commits tomorrow. Before I
> try
> >> a
> >> > > > > bisect,
> >> > > > > > > I'm going to replicate with a less intensive to iterate on
> >> > smaller
> >> > > > > scale
> >> > > > > > > perf test.
> >> > > > > > >
> >> > > > > > > Jun, inline:
> >> > > > > > >
> >> > > > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
> >> > > > > > >
> >> > > > > > > > Tom,
> >> > > > > > > >
> >> > > > > > > > Thanks for reporting this. A few quick comments.
> >> > > > > > > >
> >> > > > > > > > 1. Did you send the right command for producer-perf? The
> >> > command
> >> > > > > limits
> >> > > > > > > the
> >> > > > > > > > throughput to 100 msgs/sec. So, not sure how a single
> >> producer
> >> > > can
> >> > > > > get
> >> > > > > > > 75K
> >> > > > > > > > msgs/sec.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Ah yep, wrong commands. I'll get the right one tomorrow.
> Sorry,
> >> > was
> >> > > > > > > interpolating variables into a shell script.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > 2. Could you collect some stats (e.g. average batch size)
> in
> >> > the
> >> > > > > > producer
> >> > > > > > > > and see if there is any noticeable difference between 0.9
> and
> >> > > 0.10?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > That'd just be hooking up the CSV metrics reporter right?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > 3. Is the broker-to-broker communication also on SSL?
> Could
> >> you
> >> > > do
> >> > > > > > > another
> >> > > > > > > > test with replication factor 1 and see if you still see
> the
> >> > > > > > degradation?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Interbroker replication is always SSL in all test runs so
> far.
> >> I
> >> > > can
> >> > > > > try
> >> > > > > > > with replication factor 1 tomorrow.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > Finally, email is probably not the best way to discuss
> >> > > performance
> >> > > > > > > results.
> >> > > > > > > > If you have more of them, could you create a jira and
> attach
> >> > your
> >> > > > > > > findings
> >> > > > > > > > there?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Yep. I only wrote the email because JIRA was in lockdown
> mode
> >> > and I
> >> > > > > > > couldn't create new issues.
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > Thanks,
> >> > > > > > > >
> >> > > > > > > > Jun
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> >> > > > tcrayford@heroku.com
> >> > > > > > > > <javascript:;>> wrote:
> >> > > > > > > >
> >> > > > > > > > > We've started running our usual suite of performance
> tests
> >> > > > against
> >> > > > > > > Kafka
> >> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> >> > consumer/producer
> >> > > > > > > machines
> >> > > > > > > > to
> >> > > > > > > > > run a fairly normal mixed workload of producers and
> >> consumers
> >> > > > (each
> >> > > > > > > > > producer/consumer are just instances of kafka's inbuilt
> >> > > > > > > consumer/producer
> >> > > > > > > > > perf tests). We've found about a 33% performance drop in
> >> the
> >> > > > > producer
> >> > > > > > > if
> >> > > > > > > > > TLS is used (compared to 0.9.0.1)
> >> > > > > > > > >
> >> > > > > > > > > We've seen notable producer performance degredations
> >> between
> >> > > > > 0.9.0.1
> >> > > > > > > and
> >> > > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680
> right
> >> > now.
> >> > > > > > > > >
> >> > > > > > > > > Our specific test case runs Kafka on 8 EC2 machines,
> with
> >> > > > enhanced
> >> > > > > > > > > networking. Nothing is changed between the instances,
> and
> >> > I've
> >> > > > > > > reproduced
> >> > > > > > > > > this over 4 different sets of clusters now. We're seeing
> >> > about
> >> > > a
> >> > > > > 33%
> >> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of
> commit
> >> > > > 9404680.
> >> > > > > > > > Please
> >> > > > > > > > > to note that this doesn't match up with
> >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565,
> because
> >> > our
> >> > > > > > > > performance
> >> > > > > > > > > tests are with compression off, and this seems to be an
> TLS
> >> > > only
> >> > > > > > issue.
> >> > > > > > > > >
> >> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with
> replication
> >> > > > factor
> >> > > > > of
> >> > > > > > > 3,
> >> > > > > > > > > and 13 producers max out at around 1 million 100 byte
> >> > messages
> >> > > a
> >> > > > > > > second.
> >> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million
> messages a
> >> > > > second.
> >> > > > > > > Both
> >> > > > > > > > > tests were with TLS on. I've reproduced this on multiple
> >> > > clusters
> >> > > > > now
> >> > > > > > > (5
> >> > > > > > > > or
> >> > > > > > > > > so of each version) to account for the inherent
> performance
> >> > > > > variance
> >> > > > > > of
> >> > > > > > > > > EC2. There's no notable performance difference without
> TLS
> >> on
> >> > > > these
> >> > > > > > > runs
> >> > > > > > > > -
> >> > > > > > > > > it appears to be an TLS regression entirely.
> >> > > > > > > > >
> >> > > > > > > > > A single producer with TLS under 0.10 does about 75k
> >> > > messages/s.
> >> > > > > > Under
> >> > > > > > > > > 0.9.0.01 it does around 120k messages/s.
> >> > > > > > > > >
> >> > > > > > > > > The exact producer-perf line we're using is this:
> >> > > > > > > > >
> >> > > > > > > > > bin/kafka-producer-perf-test --topic "bench"
> --num-records
> >> > > > > > "500000000"
> >> > > > > > > > > --record-size "100" --throughput "100" --producer-props
> >> > > acks="-1"
> >> > > > > > > > > bootstrap.servers=REDACTED
> ssl.keystore.location=client.jks
> >> > > > > > > > > ssl.keystore.password=REDACTED
> >> > > ssl.truststore.location=server.jks
> >> > > > > > > > > ssl.truststore.password=REDACTED
> >> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> >> > > security.protocol=SSL
> >> > > > > > > > >
> >> > > > > > > > > We're using the same setup, machine type etc for each
> test
> >> > run.
> >> > > > > > > > >
> >> > > > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0
> >> > producers
> >> > > > and
> >> > > > > > the
> >> > > > > > > > TLS
> >> > > > > > > > > performance impact was there for both.
> >> > > > > > > > >
> >> > > > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0
> and
> >> > > > haven't
> >> > > > > > > seen
> >> > > > > > > > > anything that seemed to have this kind of impact -
> indeed
> >> the
> >> > > TLS
> >> > > > > > code
> >> > > > > > > > > doesn't seem to have changed much between 0.9.0.1 and
> >> > 0.10.0.0.
> >> > > > > > > > >
> >> > > > > > > > > Any thoughts? Should I file an issue and see about
> >> > reproducing
> >> > > a
> >> > > > > more
> >> > > > > > > > > minimal test case?
> >> > > > > > > > >
> >> > > > > > > > > I don't think this is related to
> >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 -
> that is
> >> > for
> >> > > > > > > > compression
> >> > > > > > > > > on and plaintext, and this is for TLS only.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Gwen Shapira <gw...@confluent.io>.

also, perhaps sharing the broker configuration? maybe this will
provide some hints...

On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <is...@juma.me.uk> wrote:
> Thanks Tom. I just wanted to share that I have been unable to reproduce
> this so far. Please feel free to share whatever you information you have so
> far when you have a chance, don't feel that you need to have all the
> answers.
>
> Ismael
>
> On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <tc...@heroku.com> wrote:
>
>> I've been investigating this pretty hard since I first noticed it. Right
>> now I have more avenues for investigation than I can shake a stick at, and
>> am also dealing with several other things in flight/on fire. I'll respond
>> when I have more information and can confirm things.
>>
>> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <be...@gmail.com> wrote:
>>
>> > Tom,
>> >
>> > Maybe it is mentioned and I missed. I am wondering if you see performance
>> > degradation on the consumer side when TLS is used? This could help us
>> > understand whether the issue is only producer related or TLS in general.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tc...@heroku.com>
>> > wrote:
>> >
>> > > Ismael,
>> > >
>> > > Thanks. I'm writing up an issue with some new findings since yesterday
>> > > right now.
>> > >
>> > > Thanks
>> > >
>> > > Tom
>> > >
>> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <is...@juma.me.uk>
>> wrote:
>> > >
>> > > > Hi Tom,
>> > > >
>> > > > That's because JIRA is in lockdown due to excessive spam. I have
>> added
>> > > you
>> > > > as a contributor in JIRA and you should be able to file a ticket now.
>> > > >
>> > > > Thanks,
>> > > > Ismael
>> > > >
>> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tcrayford@heroku.com
>> >
>> > > > wrote:
>> > > >
>> > > > > Ok, I don't seem to be able to file a new Jira issue at all. Can
>> > > somebody
>> > > > > check my permissions on Jira? My user is `tcrayford-heroku`
>> > > > >
>> > > > > Tom Crayford
>> > > > > Heroku Kafka
>> > > > >
>> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io>
>> wrote:
>> > > > >
>> > > > > > Tom,
>> > > > > >
>> > > > > > We don't have a CSV metrics reporter in the producer right now.
>> The
>> > > > > metrics
>> > > > > > will be available in jmx. You can find out the details in
>> > > > > >
>> http://kafka.apache.org/documentation.html#new_producer_monitoring
>> > > > > >
>> > > > > > Thanks,
>> > > > > >
>> > > > > > Jun
>> > > > > >
>> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
>> > tcrayford@heroku.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Yep, I can try those particular commits tomorrow. Before I try
>> a
>> > > > > bisect,
>> > > > > > > I'm going to replicate with a less intensive to iterate on
>> > smaller
>> > > > > scale
>> > > > > > > perf test.
>> > > > > > >
>> > > > > > > Jun, inline:
>> > > > > > >
>> > > > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
>> > > > > > >
>> > > > > > > > Tom,
>> > > > > > > >
>> > > > > > > > Thanks for reporting this. A few quick comments.
>> > > > > > > >
>> > > > > > > > 1. Did you send the right command for producer-perf? The
>> > command
>> > > > > limits
>> > > > > > > the
>> > > > > > > > throughput to 100 msgs/sec. So, not sure how a single
>> producer
>> > > can
>> > > > > get
>> > > > > > > 75K
>> > > > > > > > msgs/sec.
>> > > > > > >
>> > > > > > >
>> > > > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry,
>> > was
>> > > > > > > interpolating variables into a shell script.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > 2. Could you collect some stats (e.g. average batch size) in
>> > the
>> > > > > > producer
>> > > > > > > > and see if there is any noticeable difference between 0.9 and
>> > > 0.10?
>> > > > > > >
>> > > > > > >
>> > > > > > > That'd just be hooking up the CSV metrics reporter right?
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > 3. Is the broker-to-broker communication also on SSL? Could
>> you
>> > > do
>> > > > > > > another
>> > > > > > > > test with replication factor 1 and see if you still see the
>> > > > > > degradation?
>> > > > > > >
>> > > > > > >
>> > > > > > > Interbroker replication is always SSL in all test runs so far.
>> I
>> > > can
>> > > > > try
>> > > > > > > with replication factor 1 tomorrow.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > Finally, email is probably not the best way to discuss
>> > > performance
>> > > > > > > results.
>> > > > > > > > If you have more of them, could you create a jira and attach
>> > your
>> > > > > > > findings
>> > > > > > > > there?
>> > > > > > >
>> > > > > > >
>> > > > > > > Yep. I only wrote the email because JIRA was in lockdown mode
>> > and I
>> > > > > > > couldn't create new issues.
>> > > > > > >
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > >
>> > > > > > > > Jun
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
>> > > > tcrayford@heroku.com
>> > > > > > > > <javascript:;>> wrote:
>> > > > > > > >
>> > > > > > > > > We've started running our usual suite of performance tests
>> > > > against
>> > > > > > > Kafka
>> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
>> > consumer/producer
>> > > > > > > machines
>> > > > > > > > to
>> > > > > > > > > run a fairly normal mixed workload of producers and
>> consumers
>> > > > (each
>> > > > > > > > > producer/consumer are just instances of kafka's inbuilt
>> > > > > > > consumer/producer
>> > > > > > > > > perf tests). We've found about a 33% performance drop in
>> the
>> > > > > producer
>> > > > > > > if
>> > > > > > > > > TLS is used (compared to 0.9.0.1)
>> > > > > > > > >
>> > > > > > > > > We've seen notable producer performance degredations
>> between
>> > > > > 0.9.0.1
>> > > > > > > and
>> > > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right
>> > now.
>> > > > > > > > >
>> > > > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with
>> > > > enhanced
>> > > > > > > > > networking. Nothing is changed between the instances, and
>> > I've
>> > > > > > > reproduced
>> > > > > > > > > this over 4 different sets of clusters now. We're seeing
>> > about
>> > > a
>> > > > > 33%
>> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit
>> > > > 9404680.
>> > > > > > > > Please
>> > > > > > > > > to note that this doesn't match up with
>> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because
>> > our
>> > > > > > > > performance
>> > > > > > > > > tests are with compression off, and this seems to be an TLS
>> > > only
>> > > > > > issue.
>> > > > > > > > >
>> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication
>> > > > factor
>> > > > > of
>> > > > > > > 3,
>> > > > > > > > > and 13 producers max out at around 1 million 100 byte
>> > messages
>> > > a
>> > > > > > > second.
>> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a
>> > > > second.
>> > > > > > > Both
>> > > > > > > > > tests were with TLS on. I've reproduced this on multiple
>> > > clusters
>> > > > > now
>> > > > > > > (5
>> > > > > > > > or
>> > > > > > > > > so of each version) to account for the inherent performance
>> > > > > variance
>> > > > > > of
>> > > > > > > > > EC2. There's no notable performance difference without TLS
>> on
>> > > > these
>> > > > > > > runs
>> > > > > > > > -
>> > > > > > > > > it appears to be an TLS regression entirely.
>> > > > > > > > >
>> > > > > > > > > A single producer with TLS under 0.10 does about 75k
>> > > messages/s.
>> > > > > > Under
>> > > > > > > > > 0.9.0.01 it does around 120k messages/s.
>> > > > > > > > >
>> > > > > > > > > The exact producer-perf line we're using is this:
>> > > > > > > > >
>> > > > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
>> > > > > > "500000000"
>> > > > > > > > > --record-size "100" --throughput "100" --producer-props
>> > > acks="-1"
>> > > > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
>> > > > > > > > > ssl.keystore.password=REDACTED
>> > > ssl.truststore.location=server.jks
>> > > > > > > > > ssl.truststore.password=REDACTED
>> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
>> > > security.protocol=SSL
>> > > > > > > > >
>> > > > > > > > > We're using the same setup, machine type etc for each test
>> > run.
>> > > > > > > > >
>> > > > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0
>> > producers
>> > > > and
>> > > > > > the
>> > > > > > > > TLS
>> > > > > > > > > performance impact was there for both.
>> > > > > > > > >
>> > > > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and
>> > > > haven't
>> > > > > > > seen
>> > > > > > > > > anything that seemed to have this kind of impact - indeed
>> the
>> > > TLS
>> > > > > > code
>> > > > > > > > > doesn't seem to have changed much between 0.9.0.1 and
>> > 0.10.0.0.
>> > > > > > > > >
>> > > > > > > > > Any thoughts? Should I file an issue and see about
>> > reproducing
>> > > a
>> > > > > more
>> > > > > > > > > minimal test case?
>> > > > > > > > >
>> > > > > > > > > I don't think this is related to
>> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is
>> > for
>> > > > > > > > compression
>> > > > > > > > > on and plaintext, and this is for TLS only.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: [VOTE] 0.10.0.0 RC4

Posted by Ismael Juma <is...@juma.me.uk>.

Thanks Tom. I just wanted to share that I have been unable to reproduce
this so far. Please feel free to share whatever you information you have so
far when you have a chance, don't feel that you need to have all the
answers.

Ismael

On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <tc...@heroku.com> wrote:

> I've been investigating this pretty hard since I first noticed it. Right
> now I have more avenues for investigation than I can shake a stick at, and
> am also dealing with several other things in flight/on fire. I'll respond
> when I have more information and can confirm things.
>
> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <be...@gmail.com> wrote:
>
> > Tom,
> >
> > Maybe it is mentioned and I missed. I am wondering if you see performance
> > degradation on the consumer side when TLS is used? This could help us
> > understand whether the issue is only producer related or TLS in general.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tc...@heroku.com>
> > wrote:
> >
> > > Ismael,
> > >
> > > Thanks. I'm writing up an issue with some new findings since yesterday
> > > right now.
> > >
> > > Thanks
> > >
> > > Tom
> > >
> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <is...@juma.me.uk>
> wrote:
> > >
> > > > Hi Tom,
> > > >
> > > > That's because JIRA is in lockdown due to excessive spam. I have
> added
> > > you
> > > > as a contributor in JIRA and you should be able to file a ticket now.
> > > >
> > > > Thanks,
> > > > Ismael
> > > >
> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tcrayford@heroku.com
> >
> > > > wrote:
> > > >
> > > > > Ok, I don't seem to be able to file a new Jira issue at all. Can
> > > somebody
> > > > > check my permissions on Jira? My user is `tcrayford-heroku`
> > > > >
> > > > > Tom Crayford
> > > > > Heroku Kafka
> > > > >
> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io>
> wrote:
> > > > >
> > > > > > Tom,
> > > > > >
> > > > > > We don't have a CSV metrics reporter in the producer right now.
> The
> > > > > metrics
> > > > > > will be available in jmx. You can find out the details in
> > > > > >
> http://kafka.apache.org/documentation.html#new_producer_monitoring
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> > tcrayford@heroku.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Yep, I can try those particular commits tomorrow. Before I try
> a
> > > > > bisect,
> > > > > > > I'm going to replicate with a less intensive to iterate on
> > smaller
> > > > > scale
> > > > > > > perf test.
> > > > > > >
> > > > > > > Jun, inline:
> > > > > > >
> > > > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
> > > > > > >
> > > > > > > > Tom,
> > > > > > > >
> > > > > > > > Thanks for reporting this. A few quick comments.
> > > > > > > >
> > > > > > > > 1. Did you send the right command for producer-perf? The
> > command
> > > > > limits
> > > > > > > the
> > > > > > > > throughput to 100 msgs/sec. So, not sure how a single
> producer
> > > can
> > > > > get
> > > > > > > 75K
> > > > > > > > msgs/sec.
> > > > > > >
> > > > > > >
> > > > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry,
> > was
> > > > > > > interpolating variables into a shell script.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 2. Could you collect some stats (e.g. average batch size) in
> > the
> > > > > > producer
> > > > > > > > and see if there is any noticeable difference between 0.9 and
> > > 0.10?
> > > > > > >
> > > > > > >
> > > > > > > That'd just be hooking up the CSV metrics reporter right?
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 3. Is the broker-to-broker communication also on SSL? Could
> you
> > > do
> > > > > > > another
> > > > > > > > test with replication factor 1 and see if you still see the
> > > > > > degradation?
> > > > > > >
> > > > > > >
> > > > > > > Interbroker replication is always SSL in all test runs so far.
> I
> > > can
> > > > > try
> > > > > > > with replication factor 1 tomorrow.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Finally, email is probably not the best way to discuss
> > > performance
> > > > > > > results.
> > > > > > > > If you have more of them, could you create a jira and attach
> > your
> > > > > > > findings
> > > > > > > > there?
> > > > > > >
> > > > > > >
> > > > > > > Yep. I only wrote the email because JIRA was in lockdown mode
> > and I
> > > > > > > couldn't create new issues.
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> > > > tcrayford@heroku.com
> > > > > > > > <javascript:;>> wrote:
> > > > > > > >
> > > > > > > > > We've started running our usual suite of performance tests
> > > > against
> > > > > > > Kafka
> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> > consumer/producer
> > > > > > > machines
> > > > > > > > to
> > > > > > > > > run a fairly normal mixed workload of producers and
> consumers
> > > > (each
> > > > > > > > > producer/consumer are just instances of kafka's inbuilt
> > > > > > > consumer/producer
> > > > > > > > > perf tests). We've found about a 33% performance drop in
> the
> > > > > producer
> > > > > > > if
> > > > > > > > > TLS is used (compared to 0.9.0.1)
> > > > > > > > >
> > > > > > > > > We've seen notable producer performance degredations
> between
> > > > > 0.9.0.1
> > > > > > > and
> > > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right
> > now.
> > > > > > > > >
> > > > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with
> > > > enhanced
> > > > > > > > > networking. Nothing is changed between the instances, and
> > I've
> > > > > > > reproduced
> > > > > > > > > this over 4 different sets of clusters now. We're seeing
> > about
> > > a
> > > > > 33%
> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit
> > > > 9404680.
> > > > > > > > Please
> > > > > > > > > to note that this doesn't match up with
> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because
> > our
> > > > > > > > performance
> > > > > > > > > tests are with compression off, and this seems to be an TLS
> > > only
> > > > > > issue.
> > > > > > > > >
> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication
> > > > factor
> > > > > of
> > > > > > > 3,
> > > > > > > > > and 13 producers max out at around 1 million 100 byte
> > messages
> > > a
> > > > > > > second.
> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a
> > > > second.
> > > > > > > Both
> > > > > > > > > tests were with TLS on. I've reproduced this on multiple
> > > clusters
> > > > > now
> > > > > > > (5
> > > > > > > > or
> > > > > > > > > so of each version) to account for the inherent performance
> > > > > variance
> > > > > > of
> > > > > > > > > EC2. There's no notable performance difference without TLS
> on
> > > > these
> > > > > > > runs
> > > > > > > > -
> > > > > > > > > it appears to be an TLS regression entirely.
> > > > > > > > >
> > > > > > > > > A single producer with TLS under 0.10 does about 75k
> > > messages/s.
> > > > > > Under
> > > > > > > > > 0.9.0.01 it does around 120k messages/s.
> > > > > > > > >
> > > > > > > > > The exact producer-perf line we're using is this:
> > > > > > > > >
> > > > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
> > > > > > "500000000"
> > > > > > > > > --record-size "100" --throughput "100" --producer-props
> > > acks="-1"
> > > > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> > > > > > > > > ssl.keystore.password=REDACTED
> > > ssl.truststore.location=server.jks
> > > > > > > > > ssl.truststore.password=REDACTED
> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> > > security.protocol=SSL
> > > > > > > > >
> > > > > > > > > We're using the same setup, machine type etc for each test
> > run.
> > > > > > > > >
> > > > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0
> > producers
> > > > and
> > > > > > the
> > > > > > > > TLS
> > > > > > > > > performance impact was there for both.
> > > > > > > > >
> > > > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and
> > > > haven't
> > > > > > > seen
> > > > > > > > > anything that seemed to have this kind of impact - indeed
> the
> > > TLS
> > > > > > code
> > > > > > > > > doesn't seem to have changed much between 0.9.0.1 and
> > 0.10.0.0.
> > > > > > > > >
> > > > > > > > > Any thoughts? Should I file an issue and see about
> > reproducing
> > > a
> > > > > more
> > > > > > > > > minimal test case?
> > > > > > > > >
> > > > > > > > > I don't think this is related to
> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is
> > for
> > > > > > > > compression
> > > > > > > > > on and plaintext, and this is for TLS only.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Gwen Shapira <gw...@confluent.io>.

Hi,

We (Ismael, Magnus and Jun) are also trying to reproduce and figure it
out on our side. Will keep you posted.

Gwen

On Fri, May 13, 2016 at 11:32 AM, Tom Crayford <tc...@heroku.com> wrote:
> I've been investigating this pretty hard since I first noticed it. Right
> now I have more avenues for investigation than I can shake a stick at, and
> am also dealing with several other things in flight/on fire. I'll respond
> when I have more information and can confirm things.
>
> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <be...@gmail.com> wrote:
>
>> Tom,
>>
>> Maybe it is mentioned and I missed. I am wondering if you see performance
>> degradation on the consumer side when TLS is used? This could help us
>> understand whether the issue is only producer related or TLS in general.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tc...@heroku.com>
>> wrote:
>>
>> > Ismael,
>> >
>> > Thanks. I'm writing up an issue with some new findings since yesterday
>> > right now.
>> >
>> > Thanks
>> >
>> > Tom
>> >
>> > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <is...@juma.me.uk> wrote:
>> >
>> > > Hi Tom,
>> > >
>> > > That's because JIRA is in lockdown due to excessive spam. I have added
>> > you
>> > > as a contributor in JIRA and you should be able to file a ticket now.
>> > >
>> > > Thanks,
>> > > Ismael
>> > >
>> > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tc...@heroku.com>
>> > > wrote:
>> > >
>> > > > Ok, I don't seem to be able to file a new Jira issue at all. Can
>> > somebody
>> > > > check my permissions on Jira? My user is `tcrayford-heroku`
>> > > >
>> > > > Tom Crayford
>> > > > Heroku Kafka
>> > > >
>> > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io> wrote:
>> > > >
>> > > > > Tom,
>> > > > >
>> > > > > We don't have a CSV metrics reporter in the producer right now. The
>> > > > metrics
>> > > > > will be available in jmx. You can find out the details in
>> > > > > http://kafka.apache.org/documentation.html#new_producer_monitoring
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
>> tcrayford@heroku.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Yep, I can try those particular commits tomorrow. Before I try a
>> > > > bisect,
>> > > > > > I'm going to replicate with a less intensive to iterate on
>> smaller
>> > > > scale
>> > > > > > perf test.
>> > > > > >
>> > > > > > Jun, inline:
>> > > > > >
>> > > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
>> > > > > >
>> > > > > > > Tom,
>> > > > > > >
>> > > > > > > Thanks for reporting this. A few quick comments.
>> > > > > > >
>> > > > > > > 1. Did you send the right command for producer-perf? The
>> command
>> > > > limits
>> > > > > > the
>> > > > > > > throughput to 100 msgs/sec. So, not sure how a single producer
>> > can
>> > > > get
>> > > > > > 75K
>> > > > > > > msgs/sec.
>> > > > > >
>> > > > > >
>> > > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry,
>> was
>> > > > > > interpolating variables into a shell script.
>> > > > > >
>> > > > > >
>> > > > > > >
>> > > > > > > 2. Could you collect some stats (e.g. average batch size) in
>> the
>> > > > > producer
>> > > > > > > and see if there is any noticeable difference between 0.9 and
>> > 0.10?
>> > > > > >
>> > > > > >
>> > > > > > That'd just be hooking up the CSV metrics reporter right?
>> > > > > >
>> > > > > >
>> > > > > > >
>> > > > > > > 3. Is the broker-to-broker communication also on SSL? Could you
>> > do
>> > > > > > another
>> > > > > > > test with replication factor 1 and see if you still see the
>> > > > > degradation?
>> > > > > >
>> > > > > >
>> > > > > > Interbroker replication is always SSL in all test runs so far. I
>> > can
>> > > > try
>> > > > > > with replication factor 1 tomorrow.
>> > > > > >
>> > > > > >
>> > > > > > >
>> > > > > > > Finally, email is probably not the best way to discuss
>> > performance
>> > > > > > results.
>> > > > > > > If you have more of them, could you create a jira and attach
>> your
>> > > > > > findings
>> > > > > > > there?
>> > > > > >
>> > > > > >
>> > > > > > Yep. I only wrote the email because JIRA was in lockdown mode
>> and I
>> > > > > > couldn't create new issues.
>> > > > > >
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Jun
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
>> > > tcrayford@heroku.com
>> > > > > > > <javascript:;>> wrote:
>> > > > > > >
>> > > > > > > > We've started running our usual suite of performance tests
>> > > against
>> > > > > > Kafka
>> > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
>> consumer/producer
>> > > > > > machines
>> > > > > > > to
>> > > > > > > > run a fairly normal mixed workload of producers and consumers
>> > > (each
>> > > > > > > > producer/consumer are just instances of kafka's inbuilt
>> > > > > > consumer/producer
>> > > > > > > > perf tests). We've found about a 33% performance drop in the
>> > > > producer
>> > > > > > if
>> > > > > > > > TLS is used (compared to 0.9.0.1)
>> > > > > > > >
>> > > > > > > > We've seen notable producer performance degredations between
>> > > > 0.9.0.1
>> > > > > > and
>> > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right
>> now.
>> > > > > > > >
>> > > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with
>> > > enhanced
>> > > > > > > > networking. Nothing is changed between the instances, and
>> I've
>> > > > > > reproduced
>> > > > > > > > this over 4 different sets of clusters now. We're seeing
>> about
>> > a
>> > > > 33%
>> > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit
>> > > 9404680.
>> > > > > > > Please
>> > > > > > > > to note that this doesn't match up with
>> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because
>> our
>> > > > > > > performance
>> > > > > > > > tests are with compression off, and this seems to be an TLS
>> > only
>> > > > > issue.
>> > > > > > > >
>> > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication
>> > > factor
>> > > > of
>> > > > > > 3,
>> > > > > > > > and 13 producers max out at around 1 million 100 byte
>> messages
>> > a
>> > > > > > second.
>> > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a
>> > > second.
>> > > > > > Both
>> > > > > > > > tests were with TLS on. I've reproduced this on multiple
>> > clusters
>> > > > now
>> > > > > > (5
>> > > > > > > or
>> > > > > > > > so of each version) to account for the inherent performance
>> > > > variance
>> > > > > of
>> > > > > > > > EC2. There's no notable performance difference without TLS on
>> > > these
>> > > > > > runs
>> > > > > > > -
>> > > > > > > > it appears to be an TLS regression entirely.
>> > > > > > > >
>> > > > > > > > A single producer with TLS under 0.10 does about 75k
>> > messages/s.
>> > > > > Under
>> > > > > > > > 0.9.0.01 it does around 120k messages/s.
>> > > > > > > >
>> > > > > > > > The exact producer-perf line we're using is this:
>> > > > > > > >
>> > > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
>> > > > > "500000000"
>> > > > > > > > --record-size "100" --throughput "100" --producer-props
>> > acks="-1"
>> > > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
>> > > > > > > > ssl.keystore.password=REDACTED
>> > ssl.truststore.location=server.jks
>> > > > > > > > ssl.truststore.password=REDACTED
>> > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
>> > security.protocol=SSL
>> > > > > > > >
>> > > > > > > > We're using the same setup, machine type etc for each test
>> run.
>> > > > > > > >
>> > > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0
>> producers
>> > > and
>> > > > > the
>> > > > > > > TLS
>> > > > > > > > performance impact was there for both.
>> > > > > > > >
>> > > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and
>> > > haven't
>> > > > > > seen
>> > > > > > > > anything that seemed to have this kind of impact - indeed the
>> > TLS
>> > > > > code
>> > > > > > > > doesn't seem to have changed much between 0.9.0.1 and
>> 0.10.0.0.
>> > > > > > > >
>> > > > > > > > Any thoughts? Should I file an issue and see about
>> reproducing
>> > a
>> > > > more
>> > > > > > > > minimal test case?
>> > > > > > > >
>> > > > > > > > I don't think this is related to
>> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is
>> for
>> > > > > > > compression
>> > > > > > > > on and plaintext, and this is for TLS only.
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

I've been investigating this pretty hard since I first noticed it. Right
now I have more avenues for investigation than I can shake a stick at, and
am also dealing with several other things in flight/on fire. I'll respond
when I have more information and can confirm things.

On Fri, May 13, 2016 at 6:30 PM, Becket Qin <be...@gmail.com> wrote:

> Tom,
>
> Maybe it is mentioned and I missed. I am wondering if you see performance
> degradation on the consumer side when TLS is used? This could help us
> understand whether the issue is only producer related or TLS in general.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tc...@heroku.com>
> wrote:
>
> > Ismael,
> >
> > Thanks. I'm writing up an issue with some new findings since yesterday
> > right now.
> >
> > Thanks
> >
> > Tom
> >
> > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <is...@juma.me.uk> wrote:
> >
> > > Hi Tom,
> > >
> > > That's because JIRA is in lockdown due to excessive spam. I have added
> > you
> > > as a contributor in JIRA and you should be able to file a ticket now.
> > >
> > > Thanks,
> > > Ismael
> > >
> > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tc...@heroku.com>
> > > wrote:
> > >
> > > > Ok, I don't seem to be able to file a new Jira issue at all. Can
> > somebody
> > > > check my permissions on Jira? My user is `tcrayford-heroku`
> > > >
> > > > Tom Crayford
> > > > Heroku Kafka
> > > >
> > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Tom,
> > > > >
> > > > > We don't have a CSV metrics reporter in the producer right now. The
> > > > metrics
> > > > > will be available in jmx. You can find out the details in
> > > > > http://kafka.apache.org/documentation.html#new_producer_monitoring
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> tcrayford@heroku.com>
> > > > > wrote:
> > > > >
> > > > > > Yep, I can try those particular commits tomorrow. Before I try a
> > > > bisect,
> > > > > > I'm going to replicate with a less intensive to iterate on
> smaller
> > > > scale
> > > > > > perf test.
> > > > > >
> > > > > > Jun, inline:
> > > > > >
> > > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
> > > > > >
> > > > > > > Tom,
> > > > > > >
> > > > > > > Thanks for reporting this. A few quick comments.
> > > > > > >
> > > > > > > 1. Did you send the right command for producer-perf? The
> command
> > > > limits
> > > > > > the
> > > > > > > throughput to 100 msgs/sec. So, not sure how a single producer
> > can
> > > > get
> > > > > > 75K
> > > > > > > msgs/sec.
> > > > > >
> > > > > >
> > > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry,
> was
> > > > > > interpolating variables into a shell script.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 2. Could you collect some stats (e.g. average batch size) in
> the
> > > > > producer
> > > > > > > and see if there is any noticeable difference between 0.9 and
> > 0.10?
> > > > > >
> > > > > >
> > > > > > That'd just be hooking up the CSV metrics reporter right?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 3. Is the broker-to-broker communication also on SSL? Could you
> > do
> > > > > > another
> > > > > > > test with replication factor 1 and see if you still see the
> > > > > degradation?
> > > > > >
> > > > > >
> > > > > > Interbroker replication is always SSL in all test runs so far. I
> > can
> > > > try
> > > > > > with replication factor 1 tomorrow.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Finally, email is probably not the best way to discuss
> > performance
> > > > > > results.
> > > > > > > If you have more of them, could you create a jira and attach
> your
> > > > > > findings
> > > > > > > there?
> > > > > >
> > > > > >
> > > > > > Yep. I only wrote the email because JIRA was in lockdown mode
> and I
> > > > > > couldn't create new issues.
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> > > tcrayford@heroku.com
> > > > > > > <javascript:;>> wrote:
> > > > > > >
> > > > > > > > We've started running our usual suite of performance tests
> > > against
> > > > > > Kafka
> > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> consumer/producer
> > > > > > machines
> > > > > > > to
> > > > > > > > run a fairly normal mixed workload of producers and consumers
> > > (each
> > > > > > > > producer/consumer are just instances of kafka's inbuilt
> > > > > > consumer/producer
> > > > > > > > perf tests). We've found about a 33% performance drop in the
> > > > producer
> > > > > > if
> > > > > > > > TLS is used (compared to 0.9.0.1)
> > > > > > > >
> > > > > > > > We've seen notable producer performance degredations between
> > > > 0.9.0.1
> > > > > > and
> > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right
> now.
> > > > > > > >
> > > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with
> > > enhanced
> > > > > > > > networking. Nothing is changed between the instances, and
> I've
> > > > > > reproduced
> > > > > > > > this over 4 different sets of clusters now. We're seeing
> about
> > a
> > > > 33%
> > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit
> > > 9404680.
> > > > > > > Please
> > > > > > > > to note that this doesn't match up with
> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because
> our
> > > > > > > performance
> > > > > > > > tests are with compression off, and this seems to be an TLS
> > only
> > > > > issue.
> > > > > > > >
> > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication
> > > factor
> > > > of
> > > > > > 3,
> > > > > > > > and 13 producers max out at around 1 million 100 byte
> messages
> > a
> > > > > > second.
> > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a
> > > second.
> > > > > > Both
> > > > > > > > tests were with TLS on. I've reproduced this on multiple
> > clusters
> > > > now
> > > > > > (5
> > > > > > > or
> > > > > > > > so of each version) to account for the inherent performance
> > > > variance
> > > > > of
> > > > > > > > EC2. There's no notable performance difference without TLS on
> > > these
> > > > > > runs
> > > > > > > -
> > > > > > > > it appears to be an TLS regression entirely.
> > > > > > > >
> > > > > > > > A single producer with TLS under 0.10 does about 75k
> > messages/s.
> > > > > Under
> > > > > > > > 0.9.0.01 it does around 120k messages/s.
> > > > > > > >
> > > > > > > > The exact producer-perf line we're using is this:
> > > > > > > >
> > > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
> > > > > "500000000"
> > > > > > > > --record-size "100" --throughput "100" --producer-props
> > acks="-1"
> > > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> > > > > > > > ssl.keystore.password=REDACTED
> > ssl.truststore.location=server.jks
> > > > > > > > ssl.truststore.password=REDACTED
> > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> > security.protocol=SSL
> > > > > > > >
> > > > > > > > We're using the same setup, machine type etc for each test
> run.
> > > > > > > >
> > > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0
> producers
> > > and
> > > > > the
> > > > > > > TLS
> > > > > > > > performance impact was there for both.
> > > > > > > >
> > > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and
> > > haven't
> > > > > > seen
> > > > > > > > anything that seemed to have this kind of impact - indeed the
> > TLS
> > > > > code
> > > > > > > > doesn't seem to have changed much between 0.9.0.1 and
> 0.10.0.0.
> > > > > > > >
> > > > > > > > Any thoughts? Should I file an issue and see about
> reproducing
> > a
> > > > more
> > > > > > > > minimal test case?
> > > > > > > >
> > > > > > > > I don't think this is related to
> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is
> for
> > > > > > > compression
> > > > > > > > on and plaintext, and this is for TLS only.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Becket Qin <be...@gmail.com>.

Gwen,

The version we are currently running in production is the trunk on Feb 24.
Which has KAFKA-3025.

Our release test cluster has been running this version for about two
months, I haven't seen throughput issues so far. But we are probably not
running at the max capacity of the brokers. I will setup some throughput
test and see if I can reproduce this issue.

Thanks,

Jiangjie (Becket) Qin


On Fri, May 13, 2016 at 11:41 AM, Gwen Shapira <gw...@confluent.io> wrote:

> Becket,
>
> Did you try deploying one of the 0.10.0 candidates at LinkedIn? Did
> you see this issue?
>
> Gwen
>
> On Fri, May 13, 2016 at 10:30 AM, Becket Qin <be...@gmail.com> wrote:
> > Tom,
> >
> > Maybe it is mentioned and I missed. I am wondering if you see performance
> > degradation on the consumer side when TLS is used? This could help us
> > understand whether the issue is only producer related or TLS in general.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tc...@heroku.com>
> wrote:
> >
> >> Ismael,
> >>
> >> Thanks. I'm writing up an issue with some new findings since yesterday
> >> right now.
> >>
> >> Thanks
> >>
> >> Tom
> >>
> >> On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <is...@juma.me.uk> wrote:
> >>
> >> > Hi Tom,
> >> >
> >> > That's because JIRA is in lockdown due to excessive spam. I have added
> >> you
> >> > as a contributor in JIRA and you should be able to file a ticket now.
> >> >
> >> > Thanks,
> >> > Ismael
> >> >
> >> > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tc...@heroku.com>
> >> > wrote:
> >> >
> >> > > Ok, I don't seem to be able to file a new Jira issue at all. Can
> >> somebody
> >> > > check my permissions on Jira? My user is `tcrayford-heroku`
> >> > >
> >> > > Tom Crayford
> >> > > Heroku Kafka
> >> > >
> >> > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io> wrote:
> >> > >
> >> > > > Tom,
> >> > > >
> >> > > > We don't have a CSV metrics reporter in the producer right now.
> The
> >> > > metrics
> >> > > > will be available in jmx. You can find out the details in
> >> > > >
> http://kafka.apache.org/documentation.html#new_producer_monitoring
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jun
> >> > > >
> >> > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> tcrayford@heroku.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Yep, I can try those particular commits tomorrow. Before I try a
> >> > > bisect,
> >> > > > > I'm going to replicate with a less intensive to iterate on
> smaller
> >> > > scale
> >> > > > > perf test.
> >> > > > >
> >> > > > > Jun, inline:
> >> > > > >
> >> > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
> >> > > > >
> >> > > > > > Tom,
> >> > > > > >
> >> > > > > > Thanks for reporting this. A few quick comments.
> >> > > > > >
> >> > > > > > 1. Did you send the right command for producer-perf? The
> command
> >> > > limits
> >> > > > > the
> >> > > > > > throughput to 100 msgs/sec. So, not sure how a single producer
> >> can
> >> > > get
> >> > > > > 75K
> >> > > > > > msgs/sec.
> >> > > > >
> >> > > > >
> >> > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry,
> was
> >> > > > > interpolating variables into a shell script.
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > 2. Could you collect some stats (e.g. average batch size) in
> the
> >> > > > producer
> >> > > > > > and see if there is any noticeable difference between 0.9 and
> >> 0.10?
> >> > > > >
> >> > > > >
> >> > > > > That'd just be hooking up the CSV metrics reporter right?
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > 3. Is the broker-to-broker communication also on SSL? Could
> you
> >> do
> >> > > > > another
> >> > > > > > test with replication factor 1 and see if you still see the
> >> > > > degradation?
> >> > > > >
> >> > > > >
> >> > > > > Interbroker replication is always SSL in all test runs so far. I
> >> can
> >> > > try
> >> > > > > with replication factor 1 tomorrow.
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > Finally, email is probably not the best way to discuss
> >> performance
> >> > > > > results.
> >> > > > > > If you have more of them, could you create a jira and attach
> your
> >> > > > > findings
> >> > > > > > there?
> >> > > > >
> >> > > > >
> >> > > > > Yep. I only wrote the email because JIRA was in lockdown mode
> and I
> >> > > > > couldn't create new issues.
> >> > > > >
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > >
> >> > > > > > Jun
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> >> > tcrayford@heroku.com
> >> > > > > > <javascript:;>> wrote:
> >> > > > > >
> >> > > > > > > We've started running our usual suite of performance tests
> >> > against
> >> > > > > Kafka
> >> > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> consumer/producer
> >> > > > > machines
> >> > > > > > to
> >> > > > > > > run a fairly normal mixed workload of producers and
> consumers
> >> > (each
> >> > > > > > > producer/consumer are just instances of kafka's inbuilt
> >> > > > > consumer/producer
> >> > > > > > > perf tests). We've found about a 33% performance drop in the
> >> > > producer
> >> > > > > if
> >> > > > > > > TLS is used (compared to 0.9.0.1)
> >> > > > > > >
> >> > > > > > > We've seen notable producer performance degredations between
> >> > > 0.9.0.1
> >> > > > > and
> >> > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right
> now.
> >> > > > > > >
> >> > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with
> >> > enhanced
> >> > > > > > > networking. Nothing is changed between the instances, and
> I've
> >> > > > > reproduced
> >> > > > > > > this over 4 different sets of clusters now. We're seeing
> about
> >> a
> >> > > 33%
> >> > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit
> >> > 9404680.
> >> > > > > > Please
> >> > > > > > > to note that this doesn't match up with
> >> > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because
> our
> >> > > > > > performance
> >> > > > > > > tests are with compression off, and this seems to be an TLS
> >> only
> >> > > > issue.
> >> > > > > > >
> >> > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication
> >> > factor
> >> > > of
> >> > > > > 3,
> >> > > > > > > and 13 producers max out at around 1 million 100 byte
> messages
> >> a
> >> > > > > second.
> >> > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a
> >> > second.
> >> > > > > Both
> >> > > > > > > tests were with TLS on. I've reproduced this on multiple
> >> clusters
> >> > > now
> >> > > > > (5
> >> > > > > > or
> >> > > > > > > so of each version) to account for the inherent performance
> >> > > variance
> >> > > > of
> >> > > > > > > EC2. There's no notable performance difference without TLS
> on
> >> > these
> >> > > > > runs
> >> > > > > > -
> >> > > > > > > it appears to be an TLS regression entirely.
> >> > > > > > >
> >> > > > > > > A single producer with TLS under 0.10 does about 75k
> >> messages/s.
> >> > > > Under
> >> > > > > > > 0.9.0.01 it does around 120k messages/s.
> >> > > > > > >
> >> > > > > > > The exact producer-perf line we're using is this:
> >> > > > > > >
> >> > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
> >> > > > "500000000"
> >> > > > > > > --record-size "100" --throughput "100" --producer-props
> >> acks="-1"
> >> > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> >> > > > > > > ssl.keystore.password=REDACTED
> >> ssl.truststore.location=server.jks
> >> > > > > > > ssl.truststore.password=REDACTED
> >> > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> >> security.protocol=SSL
> >> > > > > > >
> >> > > > > > > We're using the same setup, machine type etc for each test
> run.
> >> > > > > > >
> >> > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0
> producers
> >> > and
> >> > > > the
> >> > > > > > TLS
> >> > > > > > > performance impact was there for both.
> >> > > > > > >
> >> > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and
> >> > haven't
> >> > > > > seen
> >> > > > > > > anything that seemed to have this kind of impact - indeed
> the
> >> TLS
> >> > > > code
> >> > > > > > > doesn't seem to have changed much between 0.9.0.1 and
> 0.10.0.0.
> >> > > > > > >
> >> > > > > > > Any thoughts? Should I file an issue and see about
> reproducing
> >> a
> >> > > more
> >> > > > > > > minimal test case?
> >> > > > > > >
> >> > > > > > > I don't think this is related to
> >> > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is
> for
> >> > > > > > compression
> >> > > > > > > on and plaintext, and this is for TLS only.
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Gwen Shapira <gw...@confluent.io>.

Becket,

Did you try deploying one of the 0.10.0 candidates at LinkedIn? Did
you see this issue?

Gwen

On Fri, May 13, 2016 at 10:30 AM, Becket Qin <be...@gmail.com> wrote:
> Tom,
>
> Maybe it is mentioned and I missed. I am wondering if you see performance
> degradation on the consumer side when TLS is used? This could help us
> understand whether the issue is only producer related or TLS in general.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tc...@heroku.com> wrote:
>
>> Ismael,
>>
>> Thanks. I'm writing up an issue with some new findings since yesterday
>> right now.
>>
>> Thanks
>>
>> Tom
>>
>> On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <is...@juma.me.uk> wrote:
>>
>> > Hi Tom,
>> >
>> > That's because JIRA is in lockdown due to excessive spam. I have added
>> you
>> > as a contributor in JIRA and you should be able to file a ticket now.
>> >
>> > Thanks,
>> > Ismael
>> >
>> > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tc...@heroku.com>
>> > wrote:
>> >
>> > > Ok, I don't seem to be able to file a new Jira issue at all. Can
>> somebody
>> > > check my permissions on Jira? My user is `tcrayford-heroku`
>> > >
>> > > Tom Crayford
>> > > Heroku Kafka
>> > >
>> > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io> wrote:
>> > >
>> > > > Tom,
>> > > >
>> > > > We don't have a CSV metrics reporter in the producer right now. The
>> > > metrics
>> > > > will be available in jmx. You can find out the details in
>> > > > http://kafka.apache.org/documentation.html#new_producer_monitoring
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jun
>> > > >
>> > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <tc...@heroku.com>
>> > > > wrote:
>> > > >
>> > > > > Yep, I can try those particular commits tomorrow. Before I try a
>> > > bisect,
>> > > > > I'm going to replicate with a less intensive to iterate on smaller
>> > > scale
>> > > > > perf test.
>> > > > >
>> > > > > Jun, inline:
>> > > > >
>> > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
>> > > > >
>> > > > > > Tom,
>> > > > > >
>> > > > > > Thanks for reporting this. A few quick comments.
>> > > > > >
>> > > > > > 1. Did you send the right command for producer-perf? The command
>> > > limits
>> > > > > the
>> > > > > > throughput to 100 msgs/sec. So, not sure how a single producer
>> can
>> > > get
>> > > > > 75K
>> > > > > > msgs/sec.
>> > > > >
>> > > > >
>> > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry, was
>> > > > > interpolating variables into a shell script.
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > 2. Could you collect some stats (e.g. average batch size) in the
>> > > > producer
>> > > > > > and see if there is any noticeable difference between 0.9 and
>> 0.10?
>> > > > >
>> > > > >
>> > > > > That'd just be hooking up the CSV metrics reporter right?
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > 3. Is the broker-to-broker communication also on SSL? Could you
>> do
>> > > > > another
>> > > > > > test with replication factor 1 and see if you still see the
>> > > > degradation?
>> > > > >
>> > > > >
>> > > > > Interbroker replication is always SSL in all test runs so far. I
>> can
>> > > try
>> > > > > with replication factor 1 tomorrow.
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > Finally, email is probably not the best way to discuss
>> performance
>> > > > > results.
>> > > > > > If you have more of them, could you create a jira and attach your
>> > > > > findings
>> > > > > > there?
>> > > > >
>> > > > >
>> > > > > Yep. I only wrote the email because JIRA was in lockdown mode and I
>> > > > > couldn't create new issues.
>> > > > >
>> > > > > >
>> > > > > > Thanks,
>> > > > > >
>> > > > > > Jun
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
>> > tcrayford@heroku.com
>> > > > > > <javascript:;>> wrote:
>> > > > > >
>> > > > > > > We've started running our usual suite of performance tests
>> > against
>> > > > > Kafka
>> > > > > > > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer
>> > > > > machines
>> > > > > > to
>> > > > > > > run a fairly normal mixed workload of producers and consumers
>> > (each
>> > > > > > > producer/consumer are just instances of kafka's inbuilt
>> > > > > consumer/producer
>> > > > > > > perf tests). We've found about a 33% performance drop in the
>> > > producer
>> > > > > if
>> > > > > > > TLS is used (compared to 0.9.0.1)
>> > > > > > >
>> > > > > > > We've seen notable producer performance degredations between
>> > > 0.9.0.1
>> > > > > and
>> > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
>> > > > > > >
>> > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with
>> > enhanced
>> > > > > > > networking. Nothing is changed between the instances, and I've
>> > > > > reproduced
>> > > > > > > this over 4 different sets of clusters now. We're seeing about
>> a
>> > > 33%
>> > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit
>> > 9404680.
>> > > > > > Please
>> > > > > > > to note that this doesn't match up with
>> > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because our
>> > > > > > performance
>> > > > > > > tests are with compression off, and this seems to be an TLS
>> only
>> > > > issue.
>> > > > > > >
>> > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication
>> > factor
>> > > of
>> > > > > 3,
>> > > > > > > and 13 producers max out at around 1 million 100 byte messages
>> a
>> > > > > second.
>> > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a
>> > second.
>> > > > > Both
>> > > > > > > tests were with TLS on. I've reproduced this on multiple
>> clusters
>> > > now
>> > > > > (5
>> > > > > > or
>> > > > > > > so of each version) to account for the inherent performance
>> > > variance
>> > > > of
>> > > > > > > EC2. There's no notable performance difference without TLS on
>> > these
>> > > > > runs
>> > > > > > -
>> > > > > > > it appears to be an TLS regression entirely.
>> > > > > > >
>> > > > > > > A single producer with TLS under 0.10 does about 75k
>> messages/s.
>> > > > Under
>> > > > > > > 0.9.0.01 it does around 120k messages/s.
>> > > > > > >
>> > > > > > > The exact producer-perf line we're using is this:
>> > > > > > >
>> > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
>> > > > "500000000"
>> > > > > > > --record-size "100" --throughput "100" --producer-props
>> acks="-1"
>> > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
>> > > > > > > ssl.keystore.password=REDACTED
>> ssl.truststore.location=server.jks
>> > > > > > > ssl.truststore.password=REDACTED
>> > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
>> security.protocol=SSL
>> > > > > > >
>> > > > > > > We're using the same setup, machine type etc for each test run.
>> > > > > > >
>> > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers
>> > and
>> > > > the
>> > > > > > TLS
>> > > > > > > performance impact was there for both.
>> > > > > > >
>> > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and
>> > haven't
>> > > > > seen
>> > > > > > > anything that seemed to have this kind of impact - indeed the
>> TLS
>> > > > code
>> > > > > > > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
>> > > > > > >
>> > > > > > > Any thoughts? Should I file an issue and see about reproducing
>> a
>> > > more
>> > > > > > > minimal test case?
>> > > > > > >
>> > > > > > > I don't think this is related to
>> > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
>> > > > > > compression
>> > > > > > > on and plaintext, and this is for TLS only.
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: [VOTE] 0.10.0.0 RC4

Posted by Becket Qin <be...@gmail.com>.

Tom,

Maybe it is mentioned and I missed. I am wondering if you see performance
degradation on the consumer side when TLS is used? This could help us
understand whether the issue is only producer related or TLS in general.

Thanks,

Jiangjie (Becket) Qin

On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tc...@heroku.com> wrote:

> Ismael,
>
> Thanks. I'm writing up an issue with some new findings since yesterday
> right now.
>
> Thanks
>
> Tom
>
> On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <is...@juma.me.uk> wrote:
>
> > Hi Tom,
> >
> > That's because JIRA is in lockdown due to excessive spam. I have added
> you
> > as a contributor in JIRA and you should be able to file a ticket now.
> >
> > Thanks,
> > Ismael
> >
> > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tc...@heroku.com>
> > wrote:
> >
> > > Ok, I don't seem to be able to file a new Jira issue at all. Can
> somebody
> > > check my permissions on Jira? My user is `tcrayford-heroku`
> > >
> > > Tom Crayford
> > > Heroku Kafka
> > >
> > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Tom,
> > > >
> > > > We don't have a CSV metrics reporter in the producer right now. The
> > > metrics
> > > > will be available in jmx. You can find out the details in
> > > > http://kafka.apache.org/documentation.html#new_producer_monitoring
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <tc...@heroku.com>
> > > > wrote:
> > > >
> > > > > Yep, I can try those particular commits tomorrow. Before I try a
> > > bisect,
> > > > > I'm going to replicate with a less intensive to iterate on smaller
> > > scale
> > > > > perf test.
> > > > >
> > > > > Jun, inline:
> > > > >
> > > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Tom,
> > > > > >
> > > > > > Thanks for reporting this. A few quick comments.
> > > > > >
> > > > > > 1. Did you send the right command for producer-perf? The command
> > > limits
> > > > > the
> > > > > > throughput to 100 msgs/sec. So, not sure how a single producer
> can
> > > get
> > > > > 75K
> > > > > > msgs/sec.
> > > > >
> > > > >
> > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry, was
> > > > > interpolating variables into a shell script.
> > > > >
> > > > >
> > > > > >
> > > > > > 2. Could you collect some stats (e.g. average batch size) in the
> > > > producer
> > > > > > and see if there is any noticeable difference between 0.9 and
> 0.10?
> > > > >
> > > > >
> > > > > That'd just be hooking up the CSV metrics reporter right?
> > > > >
> > > > >
> > > > > >
> > > > > > 3. Is the broker-to-broker communication also on SSL? Could you
> do
> > > > > another
> > > > > > test with replication factor 1 and see if you still see the
> > > > degradation?
> > > > >
> > > > >
> > > > > Interbroker replication is always SSL in all test runs so far. I
> can
> > > try
> > > > > with replication factor 1 tomorrow.
> > > > >
> > > > >
> > > > > >
> > > > > > Finally, email is probably not the best way to discuss
> performance
> > > > > results.
> > > > > > If you have more of them, could you create a jira and attach your
> > > > > findings
> > > > > > there?
> > > > >
> > > > >
> > > > > Yep. I only wrote the email because JIRA was in lockdown mode and I
> > > > > couldn't create new issues.
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> > tcrayford@heroku.com
> > > > > > <javascript:;>> wrote:
> > > > > >
> > > > > > > We've started running our usual suite of performance tests
> > against
> > > > > Kafka
> > > > > > > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer
> > > > > machines
> > > > > > to
> > > > > > > run a fairly normal mixed workload of producers and consumers
> > (each
> > > > > > > producer/consumer are just instances of kafka's inbuilt
> > > > > consumer/producer
> > > > > > > perf tests). We've found about a 33% performance drop in the
> > > producer
> > > > > if
> > > > > > > TLS is used (compared to 0.9.0.1)
> > > > > > >
> > > > > > > We've seen notable producer performance degredations between
> > > 0.9.0.1
> > > > > and
> > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
> > > > > > >
> > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with
> > enhanced
> > > > > > > networking. Nothing is changed between the instances, and I've
> > > > > reproduced
> > > > > > > this over 4 different sets of clusters now. We're seeing about
> a
> > > 33%
> > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit
> > 9404680.
> > > > > > Please
> > > > > > > to note that this doesn't match up with
> > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because our
> > > > > > performance
> > > > > > > tests are with compression off, and this seems to be an TLS
> only
> > > > issue.
> > > > > > >
> > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication
> > factor
> > > of
> > > > > 3,
> > > > > > > and 13 producers max out at around 1 million 100 byte messages
> a
> > > > > second.
> > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a
> > second.
> > > > > Both
> > > > > > > tests were with TLS on. I've reproduced this on multiple
> clusters
> > > now
> > > > > (5
> > > > > > or
> > > > > > > so of each version) to account for the inherent performance
> > > variance
> > > > of
> > > > > > > EC2. There's no notable performance difference without TLS on
> > these
> > > > > runs
> > > > > > -
> > > > > > > it appears to be an TLS regression entirely.
> > > > > > >
> > > > > > > A single producer with TLS under 0.10 does about 75k
> messages/s.
> > > > Under
> > > > > > > 0.9.0.01 it does around 120k messages/s.
> > > > > > >
> > > > > > > The exact producer-perf line we're using is this:
> > > > > > >
> > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
> > > > "500000000"
> > > > > > > --record-size "100" --throughput "100" --producer-props
> acks="-1"
> > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> > > > > > > ssl.keystore.password=REDACTED
> ssl.truststore.location=server.jks
> > > > > > > ssl.truststore.password=REDACTED
> > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> security.protocol=SSL
> > > > > > >
> > > > > > > We're using the same setup, machine type etc for each test run.
> > > > > > >
> > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers
> > and
> > > > the
> > > > > > TLS
> > > > > > > performance impact was there for both.
> > > > > > >
> > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and
> > haven't
> > > > > seen
> > > > > > > anything that seemed to have this kind of impact - indeed the
> TLS
> > > > code
> > > > > > > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
> > > > > > >
> > > > > > > Any thoughts? Should I file an issue and see about reproducing
> a
> > > more
> > > > > > > minimal test case?
> > > > > > >
> > > > > > > I don't think this is related to
> > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
> > > > > > compression
> > > > > > > on and plaintext, and this is for TLS only.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

Ismael,

Thanks. I'm writing up an issue with some new findings since yesterday
right now.

Thanks

Tom

On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <is...@juma.me.uk> wrote:

> Hi Tom,
>
> That's because JIRA is in lockdown due to excessive spam. I have added you
> as a contributor in JIRA and you should be able to file a ticket now.
>
> Thanks,
> Ismael
>
> On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tc...@heroku.com>
> wrote:
>
> > Ok, I don't seem to be able to file a new Jira issue at all. Can somebody
> > check my permissions on Jira? My user is `tcrayford-heroku`
> >
> > Tom Crayford
> > Heroku Kafka
> >
> > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Tom,
> > >
> > > We don't have a CSV metrics reporter in the producer right now. The
> > metrics
> > > will be available in jmx. You can find out the details in
> > > http://kafka.apache.org/documentation.html#new_producer_monitoring
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <tc...@heroku.com>
> > > wrote:
> > >
> > > > Yep, I can try those particular commits tomorrow. Before I try a
> > bisect,
> > > > I'm going to replicate with a less intensive to iterate on smaller
> > scale
> > > > perf test.
> > > >
> > > > Jun, inline:
> > > >
> > > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Tom,
> > > > >
> > > > > Thanks for reporting this. A few quick comments.
> > > > >
> > > > > 1. Did you send the right command for producer-perf? The command
> > limits
> > > > the
> > > > > throughput to 100 msgs/sec. So, not sure how a single producer can
> > get
> > > > 75K
> > > > > msgs/sec.
> > > >
> > > >
> > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry, was
> > > > interpolating variables into a shell script.
> > > >
> > > >
> > > > >
> > > > > 2. Could you collect some stats (e.g. average batch size) in the
> > > producer
> > > > > and see if there is any noticeable difference between 0.9 and 0.10?
> > > >
> > > >
> > > > That'd just be hooking up the CSV metrics reporter right?
> > > >
> > > >
> > > > >
> > > > > 3. Is the broker-to-broker communication also on SSL? Could you do
> > > > another
> > > > > test with replication factor 1 and see if you still see the
> > > degradation?
> > > >
> > > >
> > > > Interbroker replication is always SSL in all test runs so far. I can
> > try
> > > > with replication factor 1 tomorrow.
> > > >
> > > >
> > > > >
> > > > > Finally, email is probably not the best way to discuss performance
> > > > results.
> > > > > If you have more of them, could you create a jira and attach your
> > > > findings
> > > > > there?
> > > >
> > > >
> > > > Yep. I only wrote the email because JIRA was in lockdown mode and I
> > > > couldn't create new issues.
> > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > >
> > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> tcrayford@heroku.com
> > > > > <javascript:;>> wrote:
> > > > >
> > > > > > We've started running our usual suite of performance tests
> against
> > > > Kafka
> > > > > > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer
> > > > machines
> > > > > to
> > > > > > run a fairly normal mixed workload of producers and consumers
> (each
> > > > > > producer/consumer are just instances of kafka's inbuilt
> > > > consumer/producer
> > > > > > perf tests). We've found about a 33% performance drop in the
> > producer
> > > > if
> > > > > > TLS is used (compared to 0.9.0.1)
> > > > > >
> > > > > > We've seen notable producer performance degredations between
> > 0.9.0.1
> > > > and
> > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
> > > > > >
> > > > > > Our specific test case runs Kafka on 8 EC2 machines, with
> enhanced
> > > > > > networking. Nothing is changed between the instances, and I've
> > > > reproduced
> > > > > > this over 4 different sets of clusters now. We're seeing about a
> > 33%
> > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit
> 9404680.
> > > > > Please
> > > > > > to note that this doesn't match up with
> > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because our
> > > > > performance
> > > > > > tests are with compression off, and this seems to be an TLS only
> > > issue.
> > > > > >
> > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication
> factor
> > of
> > > > 3,
> > > > > > and 13 producers max out at around 1 million 100 byte messages a
> > > > second.
> > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a
> second.
> > > > Both
> > > > > > tests were with TLS on. I've reproduced this on multiple clusters
> > now
> > > > (5
> > > > > or
> > > > > > so of each version) to account for the inherent performance
> > variance
> > > of
> > > > > > EC2. There's no notable performance difference without TLS on
> these
> > > > runs
> > > > > -
> > > > > > it appears to be an TLS regression entirely.
> > > > > >
> > > > > > A single producer with TLS under 0.10 does about 75k messages/s.
> > > Under
> > > > > > 0.9.0.01 it does around 120k messages/s.
> > > > > >
> > > > > > The exact producer-perf line we're using is this:
> > > > > >
> > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
> > > "500000000"
> > > > > > --record-size "100" --throughput "100" --producer-props acks="-1"
> > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> > > > > > ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> > > > > > ssl.truststore.password=REDACTED
> > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
> > > > > >
> > > > > > We're using the same setup, machine type etc for each test run.
> > > > > >
> > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers
> and
> > > the
> > > > > TLS
> > > > > > performance impact was there for both.
> > > > > >
> > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and
> haven't
> > > > seen
> > > > > > anything that seemed to have this kind of impact - indeed the TLS
> > > code
> > > > > > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
> > > > > >
> > > > > > Any thoughts? Should I file an issue and see about reproducing a
> > more
> > > > > > minimal test case?
> > > > > >
> > > > > > I don't think this is related to
> > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
> > > > > compression
> > > > > > on and plaintext, and this is for TLS only.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Ismael Juma <is...@juma.me.uk>.

Hi Tom,

That's because JIRA is in lockdown due to excessive spam. I have added you
as a contributor in JIRA and you should be able to file a ticket now.

Thanks,
Ismael

On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tc...@heroku.com> wrote:

> Ok, I don't seem to be able to file a new Jira issue at all. Can somebody
> check my permissions on Jira? My user is `tcrayford-heroku`
>
> Tom Crayford
> Heroku Kafka
>
> On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io> wrote:
>
> > Tom,
> >
> > We don't have a CSV metrics reporter in the producer right now. The
> metrics
> > will be available in jmx. You can find out the details in
> > http://kafka.apache.org/documentation.html#new_producer_monitoring
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <tc...@heroku.com>
> > wrote:
> >
> > > Yep, I can try those particular commits tomorrow. Before I try a
> bisect,
> > > I'm going to replicate with a less intensive to iterate on smaller
> scale
> > > perf test.
> > >
> > > Jun, inline:
> > >
> > > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Tom,
> > > >
> > > > Thanks for reporting this. A few quick comments.
> > > >
> > > > 1. Did you send the right command for producer-perf? The command
> limits
> > > the
> > > > throughput to 100 msgs/sec. So, not sure how a single producer can
> get
> > > 75K
> > > > msgs/sec.
> > >
> > >
> > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry, was
> > > interpolating variables into a shell script.
> > >
> > >
> > > >
> > > > 2. Could you collect some stats (e.g. average batch size) in the
> > producer
> > > > and see if there is any noticeable difference between 0.9 and 0.10?
> > >
> > >
> > > That'd just be hooking up the CSV metrics reporter right?
> > >
> > >
> > > >
> > > > 3. Is the broker-to-broker communication also on SSL? Could you do
> > > another
> > > > test with replication factor 1 and see if you still see the
> > degradation?
> > >
> > >
> > > Interbroker replication is always SSL in all test runs so far. I can
> try
> > > with replication factor 1 tomorrow.
> > >
> > >
> > > >
> > > > Finally, email is probably not the best way to discuss performance
> > > results.
> > > > If you have more of them, could you create a jira and attach your
> > > findings
> > > > there?
> > >
> > >
> > > Yep. I only wrote the email because JIRA was in lockdown mode and I
> > > couldn't create new issues.
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > >
> > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tcrayford@heroku.com
> > > > <javascript:;>> wrote:
> > > >
> > > > > We've started running our usual suite of performance tests against
> > > Kafka
> > > > > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer
> > > machines
> > > > to
> > > > > run a fairly normal mixed workload of producers and consumers (each
> > > > > producer/consumer are just instances of kafka's inbuilt
> > > consumer/producer
> > > > > perf tests). We've found about a 33% performance drop in the
> producer
> > > if
> > > > > TLS is used (compared to 0.9.0.1)
> > > > >
> > > > > We've seen notable producer performance degredations between
> 0.9.0.1
> > > and
> > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
> > > > >
> > > > > Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> > > > > networking. Nothing is changed between the instances, and I've
> > > reproduced
> > > > > this over 4 different sets of clusters now. We're seeing about a
> 33%
> > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680.
> > > > Please
> > > > > to note that this doesn't match up with
> > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because our
> > > > performance
> > > > > tests are with compression off, and this seems to be an TLS only
> > issue.
> > > > >
> > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication factor
> of
> > > 3,
> > > > > and 13 producers max out at around 1 million 100 byte messages a
> > > second.
> > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a second.
> > > Both
> > > > > tests were with TLS on. I've reproduced this on multiple clusters
> now
> > > (5
> > > > or
> > > > > so of each version) to account for the inherent performance
> variance
> > of
> > > > > EC2. There's no notable performance difference without TLS on these
> > > runs
> > > > -
> > > > > it appears to be an TLS regression entirely.
> > > > >
> > > > > A single producer with TLS under 0.10 does about 75k messages/s.
> > Under
> > > > > 0.9.0.01 it does around 120k messages/s.
> > > > >
> > > > > The exact producer-perf line we're using is this:
> > > > >
> > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
> > "500000000"
> > > > > --record-size "100" --throughput "100" --producer-props acks="-1"
> > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> > > > > ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> > > > > ssl.truststore.password=REDACTED
> > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
> > > > >
> > > > > We're using the same setup, machine type etc for each test run.
> > > > >
> > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and
> > the
> > > > TLS
> > > > > performance impact was there for both.
> > > > >
> > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't
> > > seen
> > > > > anything that seemed to have this kind of impact - indeed the TLS
> > code
> > > > > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
> > > > >
> > > > > Any thoughts? Should I file an issue and see about reproducing a
> more
> > > > > minimal test case?
> > > > >
> > > > > I don't think this is related to
> > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
> > > > compression
> > > > > on and plaintext, and this is for TLS only.
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

Ok, I don't seem to be able to file a new Jira issue at all. Can somebody
check my permissions on Jira? My user is `tcrayford-heroku`

Tom Crayford
Heroku Kafka

On Fri, May 13, 2016 at 12:24 AM, Jun Rao <ju...@confluent.io> wrote:

> Tom,
>
> We don't have a CSV metrics reporter in the producer right now. The metrics
> will be available in jmx. You can find out the details in
> http://kafka.apache.org/documentation.html#new_producer_monitoring
>
> Thanks,
>
> Jun
>
> On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <tc...@heroku.com>
> wrote:
>
> > Yep, I can try those particular commits tomorrow. Before I try a bisect,
> > I'm going to replicate with a less intensive to iterate on smaller scale
> > perf test.
> >
> > Jun, inline:
> >
> > On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Tom,
> > >
> > > Thanks for reporting this. A few quick comments.
> > >
> > > 1. Did you send the right command for producer-perf? The command limits
> > the
> > > throughput to 100 msgs/sec. So, not sure how a single producer can get
> > 75K
> > > msgs/sec.
> >
> >
> > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry, was
> > interpolating variables into a shell script.
> >
> >
> > >
> > > 2. Could you collect some stats (e.g. average batch size) in the
> producer
> > > and see if there is any noticeable difference between 0.9 and 0.10?
> >
> >
> > That'd just be hooking up the CSV metrics reporter right?
> >
> >
> > >
> > > 3. Is the broker-to-broker communication also on SSL? Could you do
> > another
> > > test with replication factor 1 and see if you still see the
> degradation?
> >
> >
> > Interbroker replication is always SSL in all test runs so far. I can try
> > with replication factor 1 tomorrow.
> >
> >
> > >
> > > Finally, email is probably not the best way to discuss performance
> > results.
> > > If you have more of them, could you create a jira and attach your
> > findings
> > > there?
> >
> >
> > Yep. I only wrote the email because JIRA was in lockdown mode and I
> > couldn't create new issues.
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tcrayford@heroku.com
> > > <javascript:;>> wrote:
> > >
> > > > We've started running our usual suite of performance tests against
> > Kafka
> > > > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer
> > machines
> > > to
> > > > run a fairly normal mixed workload of producers and consumers (each
> > > > producer/consumer are just instances of kafka's inbuilt
> > consumer/producer
> > > > perf tests). We've found about a 33% performance drop in the producer
> > if
> > > > TLS is used (compared to 0.9.0.1)
> > > >
> > > > We've seen notable producer performance degredations between 0.9.0.1
> > and
> > > > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
> > > >
> > > > Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> > > > networking. Nothing is changed between the instances, and I've
> > reproduced
> > > > this over 4 different sets of clusters now. We're seeing about a 33%
> > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680.
> > > Please
> > > > to note that this doesn't match up with
> > > > https://issues.apache.org/jira/browse/KAFKA-3565, because our
> > > performance
> > > > tests are with compression off, and this seems to be an TLS only
> issue.
> > > >
> > > > Under 0.10.0-rc4, we see an 8 node cluster with replication factor of
> > 3,
> > > > and 13 producers max out at around 1 million 100 byte messages a
> > second.
> > > > Under 0.9.0.1, the same cluster does 1.5 million messages a second.
> > Both
> > > > tests were with TLS on. I've reproduced this on multiple clusters now
> > (5
> > > or
> > > > so of each version) to account for the inherent performance variance
> of
> > > > EC2. There's no notable performance difference without TLS on these
> > runs
> > > -
> > > > it appears to be an TLS regression entirely.
> > > >
> > > > A single producer with TLS under 0.10 does about 75k messages/s.
> Under
> > > > 0.9.0.01 it does around 120k messages/s.
> > > >
> > > > The exact producer-perf line we're using is this:
> > > >
> > > > bin/kafka-producer-perf-test --topic "bench" --num-records
> "500000000"
> > > > --record-size "100" --throughput "100" --producer-props acks="-1"
> > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> > > > ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> > > > ssl.truststore.password=REDACTED
> > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
> > > >
> > > > We're using the same setup, machine type etc for each test run.
> > > >
> > > > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and
> the
> > > TLS
> > > > performance impact was there for both.
> > > >
> > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't
> > seen
> > > > anything that seemed to have this kind of impact - indeed the TLS
> code
> > > > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
> > > >
> > > > Any thoughts? Should I file an issue and see about reproducing a more
> > > > minimal test case?
> > > >
> > > > I don't think this is related to
> > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
> > > compression
> > > > on and plaintext, and this is for TLS only.
> > > >
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Jun Rao <ju...@confluent.io>.

Tom,

We don't have a CSV metrics reporter in the producer right now. The metrics
will be available in jmx. You can find out the details in
http://kafka.apache.org/documentation.html#new_producer_monitoring

Thanks,

Jun

On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <tc...@heroku.com> wrote:

> Yep, I can try those particular commits tomorrow. Before I try a bisect,
> I'm going to replicate with a less intensive to iterate on smaller scale
> perf test.
>
> Jun, inline:
>
> On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:
>
> > Tom,
> >
> > Thanks for reporting this. A few quick comments.
> >
> > 1. Did you send the right command for producer-perf? The command limits
> the
> > throughput to 100 msgs/sec. So, not sure how a single producer can get
> 75K
> > msgs/sec.
>
>
> Ah yep, wrong commands. I'll get the right one tomorrow. Sorry, was
> interpolating variables into a shell script.
>
>
> >
> > 2. Could you collect some stats (e.g. average batch size) in the producer
> > and see if there is any noticeable difference between 0.9 and 0.10?
>
>
> That'd just be hooking up the CSV metrics reporter right?
>
>
> >
> > 3. Is the broker-to-broker communication also on SSL? Could you do
> another
> > test with replication factor 1 and see if you still see the degradation?
>
>
> Interbroker replication is always SSL in all test runs so far. I can try
> with replication factor 1 tomorrow.
>
>
> >
> > Finally, email is probably not the best way to discuss performance
> results.
> > If you have more of them, could you create a jira and attach your
> findings
> > there?
>
>
> Yep. I only wrote the email because JIRA was in lockdown mode and I
> couldn't create new issues.
>
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tcrayford@heroku.com
> > <javascript:;>> wrote:
> >
> > > We've started running our usual suite of performance tests against
> Kafka
> > > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer
> machines
> > to
> > > run a fairly normal mixed workload of producers and consumers (each
> > > producer/consumer are just instances of kafka's inbuilt
> consumer/producer
> > > perf tests). We've found about a 33% performance drop in the producer
> if
> > > TLS is used (compared to 0.9.0.1)
> > >
> > > We've seen notable producer performance degredations between 0.9.0.1
> and
> > > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
> > >
> > > Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> > > networking. Nothing is changed between the instances, and I've
> reproduced
> > > this over 4 different sets of clusters now. We're seeing about a 33%
> > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680.
> > Please
> > > to note that this doesn't match up with
> > > https://issues.apache.org/jira/browse/KAFKA-3565, because our
> > performance
> > > tests are with compression off, and this seems to be an TLS only issue.
> > >
> > > Under 0.10.0-rc4, we see an 8 node cluster with replication factor of
> 3,
> > > and 13 producers max out at around 1 million 100 byte messages a
> second.
> > > Under 0.9.0.1, the same cluster does 1.5 million messages a second.
> Both
> > > tests were with TLS on. I've reproduced this on multiple clusters now
> (5
> > or
> > > so of each version) to account for the inherent performance variance of
> > > EC2. There's no notable performance difference without TLS on these
> runs
> > -
> > > it appears to be an TLS regression entirely.
> > >
> > > A single producer with TLS under 0.10 does about 75k messages/s. Under
> > > 0.9.0.01 it does around 120k messages/s.
> > >
> > > The exact producer-perf line we're using is this:
> > >
> > > bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
> > > --record-size "100" --throughput "100" --producer-props acks="-1"
> > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> > > ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> > > ssl.truststore.password=REDACTED
> > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
> > >
> > > We're using the same setup, machine type etc for each test run.
> > >
> > > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the
> > TLS
> > > performance impact was there for both.
> > >
> > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't
> seen
> > > anything that seemed to have this kind of impact - indeed the TLS code
> > > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
> > >
> > > Any thoughts? Should I file an issue and see about reproducing a more
> > > minimal test case?
> > >
> > > I don't think this is related to
> > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
> > compression
> > > on and plaintext, and this is for TLS only.
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

Yep, I can try those particular commits tomorrow. Before I try a bisect,
I'm going to replicate with a less intensive to iterate on smaller scale
perf test.

Jun, inline:

On Thursday, 12 May 2016, Jun Rao <ju...@confluent.io> wrote:

> Tom,
>
> Thanks for reporting this. A few quick comments.
>
> 1. Did you send the right command for producer-perf? The command limits the
> throughput to 100 msgs/sec. So, not sure how a single producer can get 75K
> msgs/sec.


Ah yep, wrong commands. I'll get the right one tomorrow. Sorry, was
interpolating variables into a shell script.


>
> 2. Could you collect some stats (e.g. average batch size) in the producer
> and see if there is any noticeable difference between 0.9 and 0.10?


That'd just be hooking up the CSV metrics reporter right?


>
> 3. Is the broker-to-broker communication also on SSL? Could you do another
> test with replication factor 1 and see if you still see the degradation?


Interbroker replication is always SSL in all test runs so far. I can try
with replication factor 1 tomorrow.


>
> Finally, email is probably not the best way to discuss performance results.
> If you have more of them, could you create a jira and attach your findings
> there?


Yep. I only wrote the email because JIRA was in lockdown mode and I
couldn't create new issues.

>
> Thanks,
>
> Jun
>
>
>
> On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tcrayford@heroku.com
> <javascript:;>> wrote:
>
> > We've started running our usual suite of performance tests against Kafka
> > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines
> to
> > run a fairly normal mixed workload of producers and consumers (each
> > producer/consumer are just instances of kafka's inbuilt consumer/producer
> > perf tests). We've found about a 33% performance drop in the producer if
> > TLS is used (compared to 0.9.0.1)
> >
> > We've seen notable producer performance degredations between 0.9.0.1 and
> > 0.10.0.0 RC. We're running as of the commit 9404680 right now.
> >
> > Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> > networking. Nothing is changed between the instances, and I've reproduced
> > this over 4 different sets of clusters now. We're seeing about a 33%
> > performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680.
> Please
> > to note that this doesn't match up with
> > https://issues.apache.org/jira/browse/KAFKA-3565, because our
> performance
> > tests are with compression off, and this seems to be an TLS only issue.
> >
> > Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
> > and 13 producers max out at around 1 million 100 byte messages a second.
> > Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
> > tests were with TLS on. I've reproduced this on multiple clusters now (5
> or
> > so of each version) to account for the inherent performance variance of
> > EC2. There's no notable performance difference without TLS on these runs
> -
> > it appears to be an TLS regression entirely.
> >
> > A single producer with TLS under 0.10 does about 75k messages/s. Under
> > 0.9.0.01 it does around 120k messages/s.
> >
> > The exact producer-perf line we're using is this:
> >
> > bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
> > --record-size "100" --throughput "100" --producer-props acks="-1"
> > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> > ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> > ssl.truststore.password=REDACTED
> > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
> >
> > We're using the same setup, machine type etc for each test run.
> >
> > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the
> TLS
> > performance impact was there for both.
> >
> > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
> > anything that seemed to have this kind of impact - indeed the TLS code
> > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
> >
> > Any thoughts? Should I file an issue and see about reproducing a more
> > minimal test case?
> >
> > I don't think this is related to
> > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for
> compression
> > on and plaintext, and this is for TLS only.
> >
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Jun Rao <ju...@confluent.io>.

Tom,

Thanks for reporting this. A few quick comments.

1. Did you send the right command for producer-perf? The command limits the
throughput to 100 msgs/sec. So, not sure how a single producer can get 75K
msgs/sec.

2. Could you collect some stats (e.g. average batch size) in the producer
and see if there is any noticeable difference between 0.9 and 0.10?

3. Is the broker-to-broker communication also on SSL? Could you do another
test with replication factor 1 and see if you still see the degradation?

Finally, email is probably not the best way to discuss performance results.
If you have more of them, could you create a jira and attach your findings
there?

Thanks,

Jun



On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tc...@heroku.com> wrote:

> We've started running our usual suite of performance tests against Kafka
> 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines to
> run a fairly normal mixed workload of producers and consumers (each
> producer/consumer are just instances of kafka's inbuilt consumer/producer
> perf tests). We've found about a 33% performance drop in the producer if
> TLS is used (compared to 0.9.0.1)
>
> We've seen notable producer performance degredations between 0.9.0.1 and
> 0.10.0.0 RC. We're running as of the commit 9404680 right now.
>
> Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> networking. Nothing is changed between the instances, and I've reproduced
> this over 4 different sets of clusters now. We're seeing about a 33%
> performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680. Please
> to note that this doesn't match up with
> https://issues.apache.org/jira/browse/KAFKA-3565, because our performance
> tests are with compression off, and this seems to be an TLS only issue.
>
> Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
> and 13 producers max out at around 1 million 100 byte messages a second.
> Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
> tests were with TLS on. I've reproduced this on multiple clusters now (5 or
> so of each version) to account for the inherent performance variance of
> EC2. There's no notable performance difference without TLS on these runs -
> it appears to be an TLS regression entirely.
>
> A single producer with TLS under 0.10 does about 75k messages/s. Under
> 0.9.0.01 it does around 120k messages/s.
>
> The exact producer-perf line we're using is this:
>
> bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
> --record-size "100" --throughput "100" --producer-props acks="-1"
> bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> ssl.truststore.password=REDACTED
> ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
>
> We're using the same setup, machine type etc for each test run.
>
> We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the TLS
> performance impact was there for both.
>
> I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
> anything that seemed to have this kind of impact - indeed the TLS code
> doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
>
> Any thoughts? Should I file an issue and see about reproducing a more
> minimal test case?
>
> I don't think this is related to
> https://issues.apache.org/jira/browse/KAFKA-3565 - that is for compression
> on and plaintext, and this is for TLS only.
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Gwen Shapira <gw...@confluent.io>.

Just to confirm:
You tested both versions with plain text and saw no performance drop?


On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <tc...@heroku.com> wrote:
> We've started running our usual suite of performance tests against Kafka
> 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines to
> run a fairly normal mixed workload of producers and consumers (each
> producer/consumer are just instances of kafka's inbuilt consumer/producer
> perf tests). We've found about a 33% performance drop in the producer if
> TLS is used (compared to 0.9.0.1)
>
> We've seen notable producer performance degredations between 0.9.0.1 and
> 0.10.0.0 RC. We're running as of the commit 9404680 right now.
>
> Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> networking. Nothing is changed between the instances, and I've reproduced
> this over 4 different sets of clusters now. We're seeing about a 33%
> performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680. Please
> to note that this doesn't match up with
> https://issues.apache.org/jira/browse/KAFKA-3565, because our performance
> tests are with compression off, and this seems to be an TLS only issue.
>
> Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
> and 13 producers max out at around 1 million 100 byte messages a second.
> Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
> tests were with TLS on. I've reproduced this on multiple clusters now (5 or
> so of each version) to account for the inherent performance variance of
> EC2. There's no notable performance difference without TLS on these runs -
> it appears to be an TLS regression entirely.
>
> A single producer with TLS under 0.10 does about 75k messages/s. Under
> 0.9.0.01 it does around 120k messages/s.
>
> The exact producer-perf line we're using is this:
>
> bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
> --record-size "100" --throughput "100" --producer-props acks="-1"
> bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> ssl.truststore.password=REDACTED
> ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
>
> We're using the same setup, machine type etc for each test run.
>
> We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the TLS
> performance impact was there for both.
>
> I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
> anything that seemed to have this kind of impact - indeed the TLS code
> doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
>
> Any thoughts? Should I file an issue and see about reproducing a more
> minimal test case?
>
> I don't think this is related to
> https://issues.apache.org/jira/browse/KAFKA-3565 - that is for compression
> on and plaintext, and this is for TLS only.

Re: [VOTE] 0.10.0.0 RC4

Posted by Ismael Juma <is...@juma.me.uk>.

Hi Tom,

This is puzzling because, as you said, not much has changed in the TLS code
since 0.9.0.1. A JIRA sounds good. I was going to ask if you could test the
commit before/after KAFKA-3025, but I see that Gwen has already done that.
:)

Ismael

On Thu, May 12, 2016 at 9:26 PM, Tom Crayford <tc...@heroku.com> wrote:

> We've started running our usual suite of performance tests against Kafka
> 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines to
> run a fairly normal mixed workload of producers and consumers (each
> producer/consumer are just instances of kafka's inbuilt consumer/producer
> perf tests). We've found about a 33% performance drop in the producer if
> TLS is used (compared to 0.9.0.1)
>
> We've seen notable producer performance degredations between 0.9.0.1 and
> 0.10.0.0 RC. We're running as of the commit 9404680 right now.
>
> Our specific test case runs Kafka on 8 EC2 machines, with enhanced
> networking. Nothing is changed between the instances, and I've reproduced
> this over 4 different sets of clusters now. We're seeing about a 33%
> performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680. Please
> to note that this doesn't match up with
> https://issues.apache.org/jira/browse/KAFKA-3565, because our performance
> tests are with compression off, and this seems to be an TLS only issue.
>
> Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
> and 13 producers max out at around 1 million 100 byte messages a second.
> Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
> tests were with TLS on. I've reproduced this on multiple clusters now (5 or
> so of each version) to account for the inherent performance variance of
> EC2. There's no notable performance difference without TLS on these runs -
> it appears to be an TLS regression entirely.
>
> A single producer with TLS under 0.10 does about 75k messages/s. Under
> 0.9.0.01 it does around 120k messages/s.
>
> The exact producer-perf line we're using is this:
>
> bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
> --record-size "100" --throughput "100" --producer-props acks="-1"
> bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
> ssl.truststore.password=REDACTED
> ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL
>
> We're using the same setup, machine type etc for each test run.
>
> We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the TLS
> performance impact was there for both.
>
> I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
> anything that seemed to have this kind of impact - indeed the TLS code
> doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.
>
> Any thoughts? Should I file an issue and see about reproducing a more
> minimal test case?
>
> I don't think this is related to
> https://issues.apache.org/jira/browse/KAFKA-3565 - that is for compression
> on and plaintext, and this is for TLS only.
>

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

We've started running our usual suite of performance tests against Kafka
0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines to
run a fairly normal mixed workload of producers and consumers (each
producer/consumer are just instances of kafka's inbuilt consumer/producer
perf tests). We've found about a 33% performance drop in the producer if
TLS is used (compared to 0.9.0.1)

We've seen notable producer performance degredations between 0.9.0.1 and
0.10.0.0 RC. We're running as of the commit 9404680 right now.

Our specific test case runs Kafka on 8 EC2 machines, with enhanced
networking. Nothing is changed between the instances, and I've reproduced
this over 4 different sets of clusters now. We're seeing about a 33%
performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680. Please
to note that this doesn't match up with
https://issues.apache.org/jira/browse/KAFKA-3565, because our performance
tests are with compression off, and this seems to be an TLS only issue.

Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
and 13 producers max out at around 1 million 100 byte messages a second.
Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
tests were with TLS on. I've reproduced this on multiple clusters now (5 or
so of each version) to account for the inherent performance variance of
EC2. There's no notable performance difference without TLS on these runs -
it appears to be an TLS regression entirely.

A single producer with TLS under 0.10 does about 75k messages/s. Under
0.9.0.01 it does around 120k messages/s.

The exact producer-perf line we're using is this:

bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
--record-size "100" --throughput "100" --producer-props acks="-1"
bootstrap.servers=REDACTED ssl.keystore.location=client.jks
ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
ssl.truststore.password=REDACTED
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL

We're using the same setup, machine type etc for each test run.

We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the TLS
performance impact was there for both.

I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
anything that seemed to have this kind of impact - indeed the TLS code
doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.

Any thoughts? Should I file an issue and see about reproducing a more
minimal test case?

I don't think this is related to
https://issues.apache.org/jira/browse/KAFKA-3565 - that is for compression
on and plaintext, and this is for TLS only.

Re: [VOTE] 0.10.0.0 RC4

Posted by Tom Crayford <tc...@heroku.com>.

We've started running our usual suite of performance tests against Kafka
0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines to
run a fairly normal mixed workload of producers and consumers (each
producer/consumer are just instances of kafka's inbuilt consumer/producer
perf tests). We've found about a 33% performance drop in the producer if
TLS is used (compared to 0.9.0.1)

We've seen notable producer performance degredations between 0.9.0.1 and
0.10.0.0 RC. We're running as of the commit 9404680 right now.

Our specific test case runs Kafka on 8 EC2 machines, with enhanced
networking. Nothing is changed between the instances, and I've reproduced
this over 4 different sets of clusters now. We're seeing about a 33%
performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680. Please
to note that this doesn't match up with
https://issues.apache.org/jira/browse/KAFKA-3565, because our performance
tests are with compression off, and this seems to be an TLS only issue.

Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3,
and 13 producers max out at around 1 million 100 byte messages a second.
Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both
tests were with TLS on. I've reproduced this on multiple clusters now (5 or
so of each version) to account for the inherent performance variance of
EC2. There's no notable performance difference without TLS on these runs -
it appears to be an TLS regression entirely.

A single producer with TLS under 0.10 does about 75k messages/s. Under
0.9.0.01 it does around 120k messages/s.

The exact producer-perf line we're using is this:

bin/kafka-producer-perf-test --topic "bench" --num-records "500000000"
--record-size "100" --throughput "100" --producer-props acks="-1"
bootstrap.servers=REDACTED ssl.keystore.location=client.jks
ssl.keystore.password=REDACTED ssl.truststore.location=server.jks
ssl.truststore.password=REDACTED
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL

We're using the same setup, machine type etc for each test run.

We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the TLS
performance impact was there for both.

I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen
anything that seemed to have this kind of impact - indeed the TLS code
doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0.

Any thoughts? Should I file an issue and see about reproducing a more
minimal test case?

I don't think this is related to
https://issues.apache.org/jira/browse/KAFKA-3565 - that is for compression
on and plaintext, and this is for TLS only.

Re: [VOTE] 0.10.0.0 RC4

Posted by Gwen Shapira <gw...@confluent.io>.

Thanks for finding and reporting, Liquan. I'll wait a day or two for
more testing and roll out a new RC.

In other news:
We keep running into last minute issues with our shell scripts,
because we have zero automated testing for them.
Contribution of automated tests for our scripts will be super helpful!

Gwen

On Tue, May 10, 2016 at 7:43 PM, Liquan Pei <li...@gmail.com> wrote:
> We found a blocking issue on the release
> https://issues.apache.org/jira/browse/KAFKA-3692. This may cause the
> external CLASSPATH not be included in the final CLASSPATH in
> kafka-run-class.sh. There is no easy work around of this and we need a new
> RC.
>
> Thanks,
> Liquan
>
> On Mon, May 9, 2016 at 6:49 PM, Gwen Shapira <gw...@confluent.io> wrote:
>
>> Hello Kafka users, developers and client-developers,
>>
>> This is the first candidate for release of Apache Kafka 0.10.0.0. This
>> is a major release that includes: (1) New message format including
>> timestamps (2) client interceptor API (3) Kafka Streams. Since this is
>> a major release, we will give people more time to try it out and give
>> feedback.
>>
>> We fixed a bunch of bugs since previous release candidate, and we are
>> done with the blockers.
>> So lets test and vote!
>>
>> Release notes for the 0.10.0.0 release:
>> http://home.apache.org/~gwenshap/0.10.0.0-rc4/RELEASE_NOTES.html
>>
>> *** Please download, test and vote by Monday, May 16, 9am PT
>>
>> Kafka's KEYS file containing PGP keys we use to sign the release:
>> http://kafka.apache.org/KEYS
>>
>> * Release artifacts to be voted upon (source and binary):
>> http://home.apache.org/~gwenshap/0.10.0.0-rc4/
>>
>> * Maven artifacts to be voted upon:
>> https://repository.apache.org/content/groups/staging/
>>
>> * scala-doc
>> http://home.apache.org/~gwenshap/0.10.0.0-rc4/scaladoc
>>
>> * java-doc
>> http://home.apache.org/~gwenshap/0.10.0.0-rc4/javadoc/
>>
>> * tag to be voted upon (off 0.10.0 branch) is the 0.10.0.0-rc4 tag:
>>
>> https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=tag;h=31cf73c737d79b09b62acce461c7c22461ff5ad8
>>
>> * Documentation:
>> http://kafka.apache.org/0100/documentation.html
>>
>> * Protocol:
>> http://kafka.apache.org/0100/protocol.html
>>
>> /**************************************
>>
>> Thanks,
>>
>> Gwen
>>
>
>
>
> --
> Liquan Pei
> Software Engineer, Confluent Inc

Re: [VOTE] 0.10.0.0 RC4

Posted by Gwen Shapira <gw...@confluent.io>.

Thanks for finding and reporting, Liquan. I'll wait a day or two for
more testing and roll out a new RC.

In other news:
We keep running into last minute issues with our shell scripts,
because we have zero automated testing for them.
Contribution of automated tests for our scripts will be super helpful!

Gwen

On Tue, May 10, 2016 at 7:43 PM, Liquan Pei <li...@gmail.com> wrote:
> We found a blocking issue on the release
> https://issues.apache.org/jira/browse/KAFKA-3692. This may cause the
> external CLASSPATH not be included in the final CLASSPATH in
> kafka-run-class.sh. There is no easy work around of this and we need a new
> RC.
>
> Thanks,
> Liquan
>
> On Mon, May 9, 2016 at 6:49 PM, Gwen Shapira <gw...@confluent.io> wrote:
>
>> Hello Kafka users, developers and client-developers,
>>
>> This is the first candidate for release of Apache Kafka 0.10.0.0. This
>> is a major release that includes: (1) New message format including
>> timestamps (2) client interceptor API (3) Kafka Streams. Since this is
>> a major release, we will give people more time to try it out and give
>> feedback.
>>
>> We fixed a bunch of bugs since previous release candidate, and we are
>> done with the blockers.
>> So lets test and vote!
>>
>> Release notes for the 0.10.0.0 release:
>> http://home.apache.org/~gwenshap/0.10.0.0-rc4/RELEASE_NOTES.html
>>
>> *** Please download, test and vote by Monday, May 16, 9am PT
>>
>> Kafka's KEYS file containing PGP keys we use to sign the release:
>> http://kafka.apache.org/KEYS
>>
>> * Release artifacts to be voted upon (source and binary):
>> http://home.apache.org/~gwenshap/0.10.0.0-rc4/
>>
>> * Maven artifacts to be voted upon:
>> https://repository.apache.org/content/groups/staging/
>>
>> * scala-doc
>> http://home.apache.org/~gwenshap/0.10.0.0-rc4/scaladoc
>>
>> * java-doc
>> http://home.apache.org/~gwenshap/0.10.0.0-rc4/javadoc/
>>
>> * tag to be voted upon (off 0.10.0 branch) is the 0.10.0.0-rc4 tag:
>>
>> https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=tag;h=31cf73c737d79b09b62acce461c7c22461ff5ad8
>>
>> * Documentation:
>> http://kafka.apache.org/0100/documentation.html
>>
>> * Protocol:
>> http://kafka.apache.org/0100/protocol.html
>>
>> /**************************************
>>
>> Thanks,
>>
>> Gwen
>>
>
>
>
> --
> Liquan Pei
> Software Engineer, Confluent Inc

Re: [VOTE] 0.10.0.0 RC4

Posted by Liquan Pei <li...@gmail.com>.

We found a blocking issue on the release
https://issues.apache.org/jira/browse/KAFKA-3692. This may cause the
external CLASSPATH not be included in the final CLASSPATH in
kafka-run-class.sh. There is no easy work around of this and we need a new
RC.

Thanks,
Liquan

On Mon, May 9, 2016 at 6:49 PM, Gwen Shapira <gw...@confluent.io> wrote:

> Hello Kafka users, developers and client-developers,
>
> This is the first candidate for release of Apache Kafka 0.10.0.0. This
> is a major release that includes: (1) New message format including
> timestamps (2) client interceptor API (3) Kafka Streams. Since this is
> a major release, we will give people more time to try it out and give
> feedback.
>
> We fixed a bunch of bugs since previous release candidate, and we are
> done with the blockers.
> So lets test and vote!
>
> Release notes for the 0.10.0.0 release:
> http://home.apache.org/~gwenshap/0.10.0.0-rc4/RELEASE_NOTES.html
>
> *** Please download, test and vote by Monday, May 16, 9am PT
>
> Kafka's KEYS file containing PGP keys we use to sign the release:
> http://kafka.apache.org/KEYS
>
> * Release artifacts to be voted upon (source and binary):
> http://home.apache.org/~gwenshap/0.10.0.0-rc4/
>
> * Maven artifacts to be voted upon:
> https://repository.apache.org/content/groups/staging/
>
> * scala-doc
> http://home.apache.org/~gwenshap/0.10.0.0-rc4/scaladoc
>
> * java-doc
> http://home.apache.org/~gwenshap/0.10.0.0-rc4/javadoc/
>
> * tag to be voted upon (off 0.10.0 branch) is the 0.10.0.0-rc4 tag:
>
> https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=tag;h=31cf73c737d79b09b62acce461c7c22461ff5ad8
>
> * Documentation:
> http://kafka.apache.org/0100/documentation.html
>
> * Protocol:
> http://kafka.apache.org/0100/protocol.html
>
> /**************************************
>
> Thanks,
>
> Gwen
>



-- 
Liquan Pei
Software Engineer, Confluent Inc

Re: [VOTE] 0.10.0.0 RC4

Posted by Liquan Pei <li...@gmail.com>.

We found a blocking issue on the release
https://issues.apache.org/jira/browse/KAFKA-3692. This may cause the
external CLASSPATH not be included in the final CLASSPATH in
kafka-run-class.sh. There is no easy work around of this and we need a new
RC.

Thanks,
Liquan

On Mon, May 9, 2016 at 6:49 PM, Gwen Shapira <gw...@confluent.io> wrote:

> Hello Kafka users, developers and client-developers,
>
> This is the first candidate for release of Apache Kafka 0.10.0.0. This
> is a major release that includes: (1) New message format including
> timestamps (2) client interceptor API (3) Kafka Streams. Since this is
> a major release, we will give people more time to try it out and give
> feedback.
>
> We fixed a bunch of bugs since previous release candidate, and we are
> done with the blockers.
> So lets test and vote!
>
> Release notes for the 0.10.0.0 release:
> http://home.apache.org/~gwenshap/0.10.0.0-rc4/RELEASE_NOTES.html
>
> *** Please download, test and vote by Monday, May 16, 9am PT
>
> Kafka's KEYS file containing PGP keys we use to sign the release:
> http://kafka.apache.org/KEYS
>
> * Release artifacts to be voted upon (source and binary):
> http://home.apache.org/~gwenshap/0.10.0.0-rc4/
>
> * Maven artifacts to be voted upon:
> https://repository.apache.org/content/groups/staging/
>
> * scala-doc
> http://home.apache.org/~gwenshap/0.10.0.0-rc4/scaladoc
>
> * java-doc
> http://home.apache.org/~gwenshap/0.10.0.0-rc4/javadoc/
>
> * tag to be voted upon (off 0.10.0 branch) is the 0.10.0.0-rc4 tag:
>
> https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=tag;h=31cf73c737d79b09b62acce461c7c22461ff5ad8
>
> * Documentation:
> http://kafka.apache.org/0100/documentation.html
>
> * Protocol:
> http://kafka.apache.org/0100/protocol.html
>
> /**************************************
>
> Thanks,
>
> Gwen
>



-- 
Liquan Pei
Software Engineer, Confluent Inc