You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Bert Corderman <be...@gmail.com> on 2014/06/26 21:42:14 UTC

Question on message content, compression, multiple messages per kafka message?

We are in the process of engineering a system that will be using kafka.
The legacy system is using the local file system and  a database as the
queue.  In terms of scale we process about 35 billion events per day
contained in 15 million files.



I am looking for feedback on a design decision we are discussing



In our current system we depending heavily on compression as a performance
optimization.  In kafka the use of compression has some overhead as the
broker needs to decompress the data to assign offsets and re-compress.
(explained in detail here
http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/
)



We are thinking about NOT using Kafka compression but rather compressing
multiple rows in our code. For example let’s say we wanted to send data in
batches of 5,00 rows.  Using Kafka compression we would use a batch size of
 5,000 rows and use compression. The other option is using a batch size of
1 in Kafka BUT in our code take 5,000 messages, compress them and then send
to kafka using the kafka compression setting of none.



Is this  a pattern others have used?



Regardless of compression I am curious if others are using a single message
in kafka to contain multiple messages from an application standpoint.


Bert

Re: Question on message content, compression, multiple messages per kafka message?

Posted by Bert Corderman <be...@gmail.com>.
What would you consider being a message that is “too large”



In April I ran a bunch of tests which I outlined in the following thread



http://grokbase.com/t/kafka/users/145g8k62rf/performance-testing-data-to-share



It includes a google doc link with all the results (its easiest to download
in excel and uses filters to drill into what you want).  When looking at
snappy vs NONE I didn’t see much improvement for 2200 byte messages  we are
looking at and for small messages NONE was the fastest.



Running Kafka 8.0 on a three node cluster.  16 core, 256GB RAM, 12 4TB
drives

Message.size = 2200

Batch.size  = 400

Partitions = 12

Replication=3

acks=leader



I was able to get…

SNAPPY = 151K messages per second

NONE = 140K messages per second

GZIP = 86K messages per second



With small messages of 200 bytes

SNAPPY = 660K messages per second

NONE = 740K messages per second

GZIP = 340K messages per second





So let’s assume I can compress 2200 bytes into 200 bytes.  (Just using
these numbers as I ran tests on these sizes, my guess is I will not get
this good compression, but its an example)   If I run uncompressed I could
process 140K messages per second.  If I compressed in my application from
2200 to 200 bytes I could then send through Kafka at 740K events per second




Bert








On Thu, Jun 26, 2014 at 5:23 PM, Neha Narkhede <ne...@gmail.com>
wrote:

> Using a single Kafka message to contain an application snapshot has the
> upside of getting atomicity for free. Either the snapshot will be written
> as a whole to Kafka or not. This is poor man's transactionality. Care needs
> to be taken to ensure that the message is not too large since that might
> cause memory consumption problems on the server or the consumers.
>
> As far as compression overhead is concerned, have you tried running Snappy?
> Snappy's performance is good enough to offset the decompression-compression
> overhead on the server.
>
> Thanks,
> Neha
>
>
> On Thu, Jun 26, 2014 at 12:42 PM, Bert Corderman <be...@gmail.com>
> wrote:
>
> > We are in the process of engineering a system that will be using kafka.
> > The legacy system is using the local file system and  a database as the
> > queue.  In terms of scale we process about 35 billion events per day
> > contained in 15 million files.
> >
> >
> >
> > I am looking for feedback on a design decision we are discussing
> >
> >
> >
> > In our current system we depending heavily on compression as a
> performance
> > optimization.  In kafka the use of compression has some overhead as the
> > broker needs to decompress the data to assign offsets and re-compress.
> > (explained in detail here
> >
> >
> http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/
> > )
> >
> >
> >
> > We are thinking about NOT using Kafka compression but rather compressing
> > multiple rows in our code. For example let’s say we wanted to send data
> in
> > batches of 5,00 rows.  Using Kafka compression we would use a batch size
> of
> >  5,000 rows and use compression. The other option is using a batch size
> of
> > 1 in Kafka BUT in our code take 5,000 messages, compress them and then
> send
> > to kafka using the kafka compression setting of none.
> >
> >
> >
> > Is this  a pattern others have used?
> >
> >
> >
> > Regardless of compression I am curious if others are using a single
> message
> > in kafka to contain multiple messages from an application standpoint.
> >
> >
> > Bert
> >
>

Re: Question on message content, compression, multiple messages per kafka message?

Posted by Chris Hogue <cs...@gmail.com>.
As a data point on one system, while Snappy compression is significantly
better than gzip, for our system isn't wasn't enough to offset the
decompress/compress on the broker. No matter how fast the compression is,
doing that on the broker will always be slower than not.

We went the route the original poster is discussing and compressed in our
app on the producer, allowing the broker to do its in-place offset
management without decompressing. This has the trade-off of having to do
the batching on our own and for our consuming applications to manage the
decompress, but the result was about 3x increase in throughput from that
change alone. Our application-level messages are around 3KB each.

I think anyone wanting to get the most throughput may have good luck going
this route. The decompress/compress on the broker has a significant effect,
regardless of the compression scheme used.

-Chris



On Thu, Jun 26, 2014 at 3:23 PM, Neha Narkhede <ne...@gmail.com>
wrote:

> Using a single Kafka message to contain an application snapshot has the
> upside of getting atomicity for free. Either the snapshot will be written
> as a whole to Kafka or not. This is poor man's transactionality. Care needs
> to be taken to ensure that the message is not too large since that might
> cause memory consumption problems on the server or the consumers.
>
> As far as compression overhead is concerned, have you tried running Snappy?
> Snappy's performance is good enough to offset the decompression-compression
> overhead on the server.
>
> Thanks,
> Neha
>
>
> On Thu, Jun 26, 2014 at 12:42 PM, Bert Corderman <be...@gmail.com>
> wrote:
>
> > We are in the process of engineering a system that will be using kafka.
> > The legacy system is using the local file system and  a database as the
> > queue.  In terms of scale we process about 35 billion events per day
> > contained in 15 million files.
> >
> >
> >
> > I am looking for feedback on a design decision we are discussing
> >
> >
> >
> > In our current system we depending heavily on compression as a
> performance
> > optimization.  In kafka the use of compression has some overhead as the
> > broker needs to decompress the data to assign offsets and re-compress.
> > (explained in detail here
> >
> >
> http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/
> > )
> >
> >
> >
> > We are thinking about NOT using Kafka compression but rather compressing
> > multiple rows in our code. For example let’s say we wanted to send data
> in
> > batches of 5,00 rows.  Using Kafka compression we would use a batch size
> of
> >  5,000 rows and use compression. The other option is using a batch size
> of
> > 1 in Kafka BUT in our code take 5,000 messages, compress them and then
> send
> > to kafka using the kafka compression setting of none.
> >
> >
> >
> > Is this  a pattern others have used?
> >
> >
> >
> > Regardless of compression I am curious if others are using a single
> message
> > in kafka to contain multiple messages from an application standpoint.
> >
> >
> > Bert
> >
>

Re: Question on message content, compression, multiple messages per kafka message?

Posted by Neha Narkhede <ne...@gmail.com>.
Using a single Kafka message to contain an application snapshot has the
upside of getting atomicity for free. Either the snapshot will be written
as a whole to Kafka or not. This is poor man's transactionality. Care needs
to be taken to ensure that the message is not too large since that might
cause memory consumption problems on the server or the consumers.

As far as compression overhead is concerned, have you tried running Snappy?
Snappy's performance is good enough to offset the decompression-compression
overhead on the server.

Thanks,
Neha


On Thu, Jun 26, 2014 at 12:42 PM, Bert Corderman <be...@gmail.com> wrote:

> We are in the process of engineering a system that will be using kafka.
> The legacy system is using the local file system and  a database as the
> queue.  In terms of scale we process about 35 billion events per day
> contained in 15 million files.
>
>
>
> I am looking for feedback on a design decision we are discussing
>
>
>
> In our current system we depending heavily on compression as a performance
> optimization.  In kafka the use of compression has some overhead as the
> broker needs to decompress the data to assign offsets and re-compress.
> (explained in detail here
>
> http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/
> )
>
>
>
> We are thinking about NOT using Kafka compression but rather compressing
> multiple rows in our code. For example let’s say we wanted to send data in
> batches of 5,00 rows.  Using Kafka compression we would use a batch size of
>  5,000 rows and use compression. The other option is using a batch size of
> 1 in Kafka BUT in our code take 5,000 messages, compress them and then send
> to kafka using the kafka compression setting of none.
>
>
>
> Is this  a pattern others have used?
>
>
>
> Regardless of compression I am curious if others are using a single message
> in kafka to contain multiple messages from an application standpoint.
>
>
> Bert
>