You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Michael Noll <mi...@confluent.io> on 2017/01/09 12:48:07 UTC

Re: Storm 1.0.2 - KafkaBolt throughput skews

Dominik,

your setup looks to be:

Test data producer* -> Kafka cluster -> KafkaSpout [Storm] -> KafkaBolt

(*You didn't say what exactly you used for this, but presumably based on
Kafka's producer client.)

Here, the KafkaSpout internally uses Kafka's consumer client to read data
from Kafka (i.e. your 100M test messages), and then uses Storm's internal
messaging layer (based on Netty) to forward this data to the KafkaBolt.

One reason you see a discrepancy between the fast "native" Kafka
performance (e.g. your 650k msg/s for the KafkaProducer) vs. the slower
Storm performance (206k msg/s for KafkaBolt) is due to differences in the
messaging layer (i.e. Kafka's messaging layer vs. Storm's internal
messaging layer which is based on Netty).  Kafka is simply much more
performant on that front because the Kafka project originally started out
as a messaging system, so its messaging layer is really good.  Other
factors may include the implementation/versions of Storm/KafkaSpout/...,
any serialization/deserialization you're doing, configuration settings (you
mentioned you kept the defaults), and so on.

It would help though if you shared the exact versions of Storm and Kafka
that you have been using for your experiments.  Historically, the Kafka
integration (KafkaSpout) was only so-so but since then has improved over
time, although -- to my knowledge -- even the latest versions still trail
behind native Kafka performance.

-Michael

On Thu, Sep 29, 2016 at 8:36 PM, Dominik Safaric <do...@gmail.com>
wrote:

> Hi Everyone,
>
> In the past few days, I’ve been benchmarking Storm using a simple topology
> consisting of a KafkaSpout and KafkaBolt. For the benchmark, I’ve produced
> 100.000.000 messages into Kafka, where each message was measured in 100
> bytes. The configuration of Kafka, Zookeeper and Storm was intentionally
> left default.
>
> An interesting observation I’ve made is in regard to the KafkaBolt
> throughput. Namely, while running the KafkaProducer standalone it has an
> uniform throughput of approximately 650.000 messages per second. Whereas,
> in the case of the KafkaBolt, the throughput is at most 206.000 messages,
> with a skewed distribution where subsequent seconds may have *zero
> throughput* i.e. tuples emitted. For an overview of the distribution,
> while running the benchmark on a cluster take a look at the graph below.
>
> Now, my question is - why does the KafkaBolt have such an decreased
> throughput when compared to a standalone KafkaProducer? What factors in
> your experience influence it’s throughput?
>
> I’ve measured the same by having various configurational variances, such
> as configuring the topology.executor.(receive | send).buffer.size,
> disabling acknowledgements etcetera. But, the result although in some cases
> improved, still has a skewed throughput throughput the benchmark.
>
> Thanks in advance for sharing your experience and advice!
>
> Dominik
>
>
>

Re: Storm 1.0.2 - KafkaBolt throughput skews

Posted by Michael Noll <mi...@confluent.io>.

PS: With "configuration settings" I was also referring to e.g. settings of
the Kafka producer, which can be configured to batch and send messages
asynchronously, for example.  This will lead of course to much faster
results than configuring the producer to (say) send messages synchronously
one-at-a-time.  So, even though you said you intentionally kept Storm etc.
settings at their default, I'd still check those settings to see whether
they match up properly.



On Mon, Jan 9, 2017 at 1:48 PM, Michael Noll <mi...@confluent.io> wrote:

> Dominik,
>
> your setup looks to be:
>
> Test data producer* -> Kafka cluster -> KafkaSpout [Storm] -> KafkaBolt
>
> (*You didn't say what exactly you used for this, but presumably based on
> Kafka's producer client.)
>
> Here, the KafkaSpout internally uses Kafka's consumer client to read data
> from Kafka (i.e. your 100M test messages), and then uses Storm's internal
> messaging layer (based on Netty) to forward this data to the KafkaBolt.
>
> One reason you see a discrepancy between the fast "native" Kafka
> performance (e.g. your 650k msg/s for the KafkaProducer) vs. the slower
> Storm performance (206k msg/s for KafkaBolt) is due to differences in the
> messaging layer (i.e. Kafka's messaging layer vs. Storm's internal
> messaging layer which is based on Netty).  Kafka is simply much more
> performant on that front because the Kafka project originally started out
> as a messaging system, so its messaging layer is really good.  Other
> factors may include the implementation/versions of Storm/KafkaSpout/...,
> any serialization/deserialization you're doing, configuration settings (you
> mentioned you kept the defaults), and so on.
>
> It would help though if you shared the exact versions of Storm and Kafka
> that you have been using for your experiments.  Historically, the Kafka
> integration (KafkaSpout) was only so-so but since then has improved over
> time, although -- to my knowledge -- even the latest versions still trail
> behind native Kafka performance.
>
> -Michael
>
>
>
>
> On Thu, Sep 29, 2016 at 8:36 PM, Dominik Safaric <dominiksafaric@gmail.com
> > wrote:
>
>> Hi Everyone,
>>
>> In the past few days, I’ve been benchmarking Storm using a simple
>> topology consisting of a KafkaSpout and KafkaBolt. For the benchmark, I’ve
>> produced 100.000.000 messages into Kafka, where each message was measured
>> in 100 bytes. The configuration of Kafka, Zookeeper and Storm was
>> intentionally left default.
>>
>> An interesting observation I’ve made is in regard to the KafkaBolt
>> throughput. Namely, while running the KafkaProducer standalone it has an
>> uniform throughput of approximately 650.000 messages per second. Whereas,
>> in the case of the KafkaBolt, the throughput is at most 206.000 messages,
>> with a skewed distribution where subsequent seconds may have *zero
>> throughput* i.e. tuples emitted. For an overview of the distribution,
>> while running the benchmark on a cluster take a look at the graph below.
>>
>> Now, my question is - why does the KafkaBolt have such an decreased
>> throughput when compared to a standalone KafkaProducer? What factors in
>> your experience influence it’s throughput?
>>
>> I’ve measured the same by having various configurational variances, such
>> as configuring the topology.executor.(receive | send).buffer.size,
>> disabling acknowledgements etcetera. But, the result although in some cases
>> improved, still has a skewed throughput throughput the benchmark.
>>
>> Thanks in advance for sharing your experience and advice!
>>
>> Dominik
>>
>>
>>
>
>