You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Ben Stopford <be...@confluent.io> on 2015/09/01 21:28:17 UTC

Re: ProducerPerformance.scala compressing Array of Zeros

You’re absolutely right. This should be fixed. I’ve made a note of this in https://issues.apache.org/jira/browse/KAFKA-2499 <https://issues.apache.org/jira/browse/KAFKA-2499>. 

If you’d like to submit a pull request for this that would be awesome :) 

Otherwise I’ll try and fit it into the other performance stuff I’m looking at. 

Ben


> On 31 Aug 2015, at 12:22, Prabhjot Bharaj <pr...@gmail.com> wrote:
> 
> Hello Folks,
> 
> I was going through ProducerPerformance.scala.
> 
> Having a close look at line no. 247 in 'def generateProducerData'
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/tools/ProducerPerformance.scala,
> the message that the producer sends to kafka is an Array of 0s.
> 
> Basic understanding of compression algorithms suggest that compressing
> repetitive data can give best compression.
> 
> 
> I have also observed that when compressing array of zero bytes, the
> throughput increases significantly when I use lz4 or snappy vs
> CoCompressionCodec. But, this is largely dependent on the nature of data.
> 
> 
> Is this what we are trying to test here?
> Or, should the ProducerPerformance.scala create array of random bytes
> (instead of just zeroes) ?
> 
> If this can be improved, shall I open an issue to track this ?
> 
> Regards,
> Prabhjot

Re: ProducerPerformance.scala compressing Array of Zeros

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.

I kind of think letting the ProducerPerformance send uncompressed bytes is
not a bad idea. The reason being is that when you send compressed bytes, it
is not easy to determine how much data are you actually send. Arguably
sending uncompressed bytes does not take compression cost into performance
benchmark, but I think it is fine because the cost of compression is
typically ignorable compared with network IO and it also depends on what
kind of raw data are you sending. So it does not make too much sense to me
to add the compression step in the ProducerPerformance.

But I absolutely agree that we should document this well so people
understand that:
1. ProducerPerformance should be used to for uncompressed message.
2. The compression is very content dependent so it is not part of the
ProducerPerformance.

If we really want to test compressed message as well, maybe we can take a
file path as argument and use that as data source.

Thanks,

Jiangjie (Becket) Qin

On Tue, Sep 1, 2015 at 12:28 PM, Ben Stopford <be...@confluent.io> wrote:

> You’re absolutely right. This should be fixed. I’ve made a note of this in
> https://issues.apache.org/jira/browse/KAFKA-2499 <
> https://issues.apache.org/jira/browse/KAFKA-2499>.
>
> If you’d like to submit a pull request for this that would be awesome :)
>
> Otherwise I’ll try and fit it into the other performance stuff I’m looking
> at.
>
> Ben
>
>
> > On 31 Aug 2015, at 12:22, Prabhjot Bharaj <pr...@gmail.com> wrote:
> >
> > Hello Folks,
> >
> > I was going through ProducerPerformance.scala.
> >
> > Having a close look at line no. 247 in 'def generateProducerData'
> >
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/tools/ProducerPerformance.scala
> ,
> > the message that the producer sends to kafka is an Array of 0s.
> >
> > Basic understanding of compression algorithms suggest that compressing
> > repetitive data can give best compression.
> >
> >
> > I have also observed that when compressing array of zero bytes, the
> > throughput increases significantly when I use lz4 or snappy vs
> > CoCompressionCodec. But, this is largely dependent on the nature of data.
> >
> >
> > Is this what we are trying to test here?
> > Or, should the ProducerPerformance.scala create array of random bytes
> > (instead of just zeroes) ?
> >
> > If this can be improved, shall I open an issue to track this ?
> >
> > Regards,
> > Prabhjot
>
>