You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Dan Hill <qu...@gmail.com> on 2022/03/14 20:56:03 UTC

Compression - producer vs topic?

Hi.  I looked around for advice about Kafka compression.  I've seen mixed
and conflicting advice.

Is there any sorta "if X, do Y" type of documentation around Kafka
compression?

Any advice?  Any good posts to read that talk about this trade off?

*Detailed comments*
I tried looking for producer vs topic compression.  I didn't find much.
Some of the information I see is back from 2011 (which I'm guessing is
pretty stale).

I can guess some potential benefits but I don't know if they are actually
real.  I've also seen some sites claim certain trade offs but it's unclear
if they're true.

It looks like I can modify an existing topic's compression.  I don't know
if that actually works.  I'd assume it'd just impact data going forward.

I've seen multiple sites say that decompression happens in the broker and
multiple that say it happens in the consumer.

Re: Compression - producer vs topic?

Posted by Dan Hill <qu...@gmail.com>.
Thanks, Liam!  I was convinced to do zstd.  I'm using an older version of
Flink that uses an older Kafka Producer (so zstd isn't available in it).
I'll switch to zstd when I upgrade.

On Tue, Mar 15, 2022 at 3:52 PM Liam Clarke-Hutchinson <lc...@redhat.com>
wrote:

> Oh, and meant to say, zstd is a good compromise between CPU and compression
> ratio, IIRC it was far less costly on CPU than gzip.
>
> So yeah, I generally recommend setting your topic's compression to
> "producer", and then going from there.
>
> On Wed, 16 Mar 2022 at 11:49, Liam Clarke-Hutchinson <lc...@redhat.com>
> wrote:
>
> > Sounds like a goer then :) Those strings in the protobuf always get ya,
> > can't use clever encodings for them like you can with numbers.
> >
> > On Wed, 16 Mar 2022 at 11:29, Dan Hill <qu...@gmail.com> wrote:
> >
> >> We're using protos but there are still a bunch of custom fields where
> >> clients specify redundant strings.
> >>
> >> My local test is showing 75% reduction in size if I use zstd or gzip.  I
> >> care the most about Kafka storage costs right now.
> >>
> >> On Tue, Mar 15, 2022 at 2:25 PM Liam Clarke-Hutchinson <
> >> lclarkeh@redhat.com>
> >> wrote:
> >>
> >> > Hi Dan,
> >> >
> >> > Okay, so if you're looking for low latency, I'm guessing that you're
> >> using
> >> > a very low linger.ms in the producers? Also, what format are the
> >> records?
> >> > If they're already in a binary format like Protobuf or Avro, unless
> >> they're
> >> > composed largely of strings, compression may offer little benefit.
> >> >
> >> > With your small records, I'd suggest running some tests with your
> >> current
> >> > config with different compression settings - none, snappy, lz4, (don't
> >> > bother with gzip unless that's all you have) and checking producer
> >> metrics
> >> > (available via JMX if you're using the Java clients) for
> avg-batch-size
> >> and
> >> > compression-ratio.
> >> >
> >> > You may just wish to start with no compression, and then consider
> >> moving to
> >> > it if/when network bandwidth becomes a bottleneck.
> >> >
> >> > Regards,
> >> >
> >> > Liam
> >> >
> >> > On Tue, 15 Mar 2022 at 17:05, Dan Hill <qu...@gmail.com> wrote:
> >> >
> >> > > Thanks, Liam!
> >> > >
> >> > > I have a mixture of Kafka record size.  10% are large (>100kbs) and
> >> 90%
> >> > of
> >> > > the records are smaller than 1kb.  I'm working on a streaming
> >> analytics
> >> > > solution that streams impressions, user actions and serving info and
> >> > > combines them together.  End-to-end latency is more important than
> >> > storage
> >> > > size.
> >> > >
> >> > >
> >> > > On Mon, Mar 14, 2022 at 3:27 PM Liam Clarke-Hutchinson <
> >> > > lclarkeh@redhat.com>
> >> > > wrote:
> >> > >
> >> > > > Hi Dan,
> >> > > >
> >> > > > Decompression generally only happens in the broker if the topic
> has
> >> a
> >> > > > particular compression algorithm set, and the producer is using a
> >> > > different
> >> > > > one - then the broker will decompress records from the producer,
> >> then
> >> > > > recompress it using the topic's configured algorithm. (The
> >> LogCleaner
> >> > > will
> >> > > > also decompress then recompress records when compacting compressed
> >> > > topics).
> >> > > >
> >> > > > The consumer decompresses compressed record batches it receives.
> >> > > >
> >> > > > In my opinion, using topic compression instead of producer
> >> compression
> >> > > > would only make sense if the overhead of a few more CPU cycles
> >> > > compression
> >> > > > uses was not tolerable for the producing app. In all of my use
> >> cases,
> >> > > > network throughput becomes a bottleneck long before producer
> >> > compression
> >> > > > CPU cost does.
> >> > > >
> >> > > > For your "if X, do Y" formulation I'd say - if your producer is
> >> sending
> >> > > > tiny batches, do some analysis of compressed vs. uncompressed size
> >> for
> >> > > your
> >> > > > given compression algorithm - you may find that compression
> overhead
> >> > > > increases batch size for tiny batches.
> >> > > >
> >> > > > If you're sending a large amount of data, do tune your batching
> and
> >> use
> >> > > > compression to reduce data being sent over the wire.
> >> > > >
> >> > > > If you can tell us more about what your problem domain, there
> might
> >> be
> >> > > more
> >> > > > advice that's applicable :)
> >> > > >
> >> > > > Cheers,
> >> > > >
> >> > > > Liam Clarke-Hutchinson
> >> > > >
> >> > > > On Tue, 15 Mar 2022 at 10:05, Dan Hill <qu...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > Hi.  I looked around for advice about Kafka compression.  I've
> >> seen
> >> > > mixed
> >> > > > > and conflicting advice.
> >> > > > >
> >> > > > > Is there any sorta "if X, do Y" type of documentation around
> Kafka
> >> > > > > compression?
> >> > > > >
> >> > > > > Any advice?  Any good posts to read that talk about this trade
> >> off?
> >> > > > >
> >> > > > > *Detailed comments*
> >> > > > > I tried looking for producer vs topic compression.  I didn't
> find
> >> > much.
> >> > > > > Some of the information I see is back from 2011 (which I'm
> >> guessing
> >> > is
> >> > > > > pretty stale).
> >> > > > >
> >> > > > > I can guess some potential benefits but I don't know if they are
> >> > > actually
> >> > > > > real.  I've also seen some sites claim certain trade offs but
> it's
> >> > > > unclear
> >> > > > > if they're true.
> >> > > > >
> >> > > > > It looks like I can modify an existing topic's compression.  I
> >> don't
> >> > > know
> >> > > > > if that actually works.  I'd assume it'd just impact data going
> >> > > forward.
> >> > > > >
> >> > > > > I've seen multiple sites say that decompression happens in the
> >> broker
> >> > > and
> >> > > > > multiple that say it happens in the consumer.
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Compression - producer vs topic?

Posted by Liam Clarke-Hutchinson <lc...@redhat.com>.
Oh, and meant to say, zstd is a good compromise between CPU and compression
ratio, IIRC it was far less costly on CPU than gzip.

So yeah, I generally recommend setting your topic's compression to
"producer", and then going from there.

On Wed, 16 Mar 2022 at 11:49, Liam Clarke-Hutchinson <lc...@redhat.com>
wrote:

> Sounds like a goer then :) Those strings in the protobuf always get ya,
> can't use clever encodings for them like you can with numbers.
>
> On Wed, 16 Mar 2022 at 11:29, Dan Hill <qu...@gmail.com> wrote:
>
>> We're using protos but there are still a bunch of custom fields where
>> clients specify redundant strings.
>>
>> My local test is showing 75% reduction in size if I use zstd or gzip.  I
>> care the most about Kafka storage costs right now.
>>
>> On Tue, Mar 15, 2022 at 2:25 PM Liam Clarke-Hutchinson <
>> lclarkeh@redhat.com>
>> wrote:
>>
>> > Hi Dan,
>> >
>> > Okay, so if you're looking for low latency, I'm guessing that you're
>> using
>> > a very low linger.ms in the producers? Also, what format are the
>> records?
>> > If they're already in a binary format like Protobuf or Avro, unless
>> they're
>> > composed largely of strings, compression may offer little benefit.
>> >
>> > With your small records, I'd suggest running some tests with your
>> current
>> > config with different compression settings - none, snappy, lz4, (don't
>> > bother with gzip unless that's all you have) and checking producer
>> metrics
>> > (available via JMX if you're using the Java clients) for avg-batch-size
>> and
>> > compression-ratio.
>> >
>> > You may just wish to start with no compression, and then consider
>> moving to
>> > it if/when network bandwidth becomes a bottleneck.
>> >
>> > Regards,
>> >
>> > Liam
>> >
>> > On Tue, 15 Mar 2022 at 17:05, Dan Hill <qu...@gmail.com> wrote:
>> >
>> > > Thanks, Liam!
>> > >
>> > > I have a mixture of Kafka record size.  10% are large (>100kbs) and
>> 90%
>> > of
>> > > the records are smaller than 1kb.  I'm working on a streaming
>> analytics
>> > > solution that streams impressions, user actions and serving info and
>> > > combines them together.  End-to-end latency is more important than
>> > storage
>> > > size.
>> > >
>> > >
>> > > On Mon, Mar 14, 2022 at 3:27 PM Liam Clarke-Hutchinson <
>> > > lclarkeh@redhat.com>
>> > > wrote:
>> > >
>> > > > Hi Dan,
>> > > >
>> > > > Decompression generally only happens in the broker if the topic has
>> a
>> > > > particular compression algorithm set, and the producer is using a
>> > > different
>> > > > one - then the broker will decompress records from the producer,
>> then
>> > > > recompress it using the topic's configured algorithm. (The
>> LogCleaner
>> > > will
>> > > > also decompress then recompress records when compacting compressed
>> > > topics).
>> > > >
>> > > > The consumer decompresses compressed record batches it receives.
>> > > >
>> > > > In my opinion, using topic compression instead of producer
>> compression
>> > > > would only make sense if the overhead of a few more CPU cycles
>> > > compression
>> > > > uses was not tolerable for the producing app. In all of my use
>> cases,
>> > > > network throughput becomes a bottleneck long before producer
>> > compression
>> > > > CPU cost does.
>> > > >
>> > > > For your "if X, do Y" formulation I'd say - if your producer is
>> sending
>> > > > tiny batches, do some analysis of compressed vs. uncompressed size
>> for
>> > > your
>> > > > given compression algorithm - you may find that compression overhead
>> > > > increases batch size for tiny batches.
>> > > >
>> > > > If you're sending a large amount of data, do tune your batching and
>> use
>> > > > compression to reduce data being sent over the wire.
>> > > >
>> > > > If you can tell us more about what your problem domain, there might
>> be
>> > > more
>> > > > advice that's applicable :)
>> > > >
>> > > > Cheers,
>> > > >
>> > > > Liam Clarke-Hutchinson
>> > > >
>> > > > On Tue, 15 Mar 2022 at 10:05, Dan Hill <qu...@gmail.com>
>> wrote:
>> > > >
>> > > > > Hi.  I looked around for advice about Kafka compression.  I've
>> seen
>> > > mixed
>> > > > > and conflicting advice.
>> > > > >
>> > > > > Is there any sorta "if X, do Y" type of documentation around Kafka
>> > > > > compression?
>> > > > >
>> > > > > Any advice?  Any good posts to read that talk about this trade
>> off?
>> > > > >
>> > > > > *Detailed comments*
>> > > > > I tried looking for producer vs topic compression.  I didn't find
>> > much.
>> > > > > Some of the information I see is back from 2011 (which I'm
>> guessing
>> > is
>> > > > > pretty stale).
>> > > > >
>> > > > > I can guess some potential benefits but I don't know if they are
>> > > actually
>> > > > > real.  I've also seen some sites claim certain trade offs but it's
>> > > > unclear
>> > > > > if they're true.
>> > > > >
>> > > > > It looks like I can modify an existing topic's compression.  I
>> don't
>> > > know
>> > > > > if that actually works.  I'd assume it'd just impact data going
>> > > forward.
>> > > > >
>> > > > > I've seen multiple sites say that decompression happens in the
>> broker
>> > > and
>> > > > > multiple that say it happens in the consumer.
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Compression - producer vs topic?

Posted by Liam Clarke-Hutchinson <lc...@redhat.com>.
Sounds like a goer then :) Those strings in the protobuf always get ya,
can't use clever encodings for them like you can with numbers.

On Wed, 16 Mar 2022 at 11:29, Dan Hill <qu...@gmail.com> wrote:

> We're using protos but there are still a bunch of custom fields where
> clients specify redundant strings.
>
> My local test is showing 75% reduction in size if I use zstd or gzip.  I
> care the most about Kafka storage costs right now.
>
> On Tue, Mar 15, 2022 at 2:25 PM Liam Clarke-Hutchinson <
> lclarkeh@redhat.com>
> wrote:
>
> > Hi Dan,
> >
> > Okay, so if you're looking for low latency, I'm guessing that you're
> using
> > a very low linger.ms in the producers? Also, what format are the
> records?
> > If they're already in a binary format like Protobuf or Avro, unless
> they're
> > composed largely of strings, compression may offer little benefit.
> >
> > With your small records, I'd suggest running some tests with your current
> > config with different compression settings - none, snappy, lz4, (don't
> > bother with gzip unless that's all you have) and checking producer
> metrics
> > (available via JMX if you're using the Java clients) for avg-batch-size
> and
> > compression-ratio.
> >
> > You may just wish to start with no compression, and then consider moving
> to
> > it if/when network bandwidth becomes a bottleneck.
> >
> > Regards,
> >
> > Liam
> >
> > On Tue, 15 Mar 2022 at 17:05, Dan Hill <qu...@gmail.com> wrote:
> >
> > > Thanks, Liam!
> > >
> > > I have a mixture of Kafka record size.  10% are large (>100kbs) and 90%
> > of
> > > the records are smaller than 1kb.  I'm working on a streaming analytics
> > > solution that streams impressions, user actions and serving info and
> > > combines them together.  End-to-end latency is more important than
> > storage
> > > size.
> > >
> > >
> > > On Mon, Mar 14, 2022 at 3:27 PM Liam Clarke-Hutchinson <
> > > lclarkeh@redhat.com>
> > > wrote:
> > >
> > > > Hi Dan,
> > > >
> > > > Decompression generally only happens in the broker if the topic has a
> > > > particular compression algorithm set, and the producer is using a
> > > different
> > > > one - then the broker will decompress records from the producer, then
> > > > recompress it using the topic's configured algorithm. (The LogCleaner
> > > will
> > > > also decompress then recompress records when compacting compressed
> > > topics).
> > > >
> > > > The consumer decompresses compressed record batches it receives.
> > > >
> > > > In my opinion, using topic compression instead of producer
> compression
> > > > would only make sense if the overhead of a few more CPU cycles
> > > compression
> > > > uses was not tolerable for the producing app. In all of my use cases,
> > > > network throughput becomes a bottleneck long before producer
> > compression
> > > > CPU cost does.
> > > >
> > > > For your "if X, do Y" formulation I'd say - if your producer is
> sending
> > > > tiny batches, do some analysis of compressed vs. uncompressed size
> for
> > > your
> > > > given compression algorithm - you may find that compression overhead
> > > > increases batch size for tiny batches.
> > > >
> > > > If you're sending a large amount of data, do tune your batching and
> use
> > > > compression to reduce data being sent over the wire.
> > > >
> > > > If you can tell us more about what your problem domain, there might
> be
> > > more
> > > > advice that's applicable :)
> > > >
> > > > Cheers,
> > > >
> > > > Liam Clarke-Hutchinson
> > > >
> > > > On Tue, 15 Mar 2022 at 10:05, Dan Hill <qu...@gmail.com>
> wrote:
> > > >
> > > > > Hi.  I looked around for advice about Kafka compression.  I've seen
> > > mixed
> > > > > and conflicting advice.
> > > > >
> > > > > Is there any sorta "if X, do Y" type of documentation around Kafka
> > > > > compression?
> > > > >
> > > > > Any advice?  Any good posts to read that talk about this trade off?
> > > > >
> > > > > *Detailed comments*
> > > > > I tried looking for producer vs topic compression.  I didn't find
> > much.
> > > > > Some of the information I see is back from 2011 (which I'm guessing
> > is
> > > > > pretty stale).
> > > > >
> > > > > I can guess some potential benefits but I don't know if they are
> > > actually
> > > > > real.  I've also seen some sites claim certain trade offs but it's
> > > > unclear
> > > > > if they're true.
> > > > >
> > > > > It looks like I can modify an existing topic's compression.  I
> don't
> > > know
> > > > > if that actually works.  I'd assume it'd just impact data going
> > > forward.
> > > > >
> > > > > I've seen multiple sites say that decompression happens in the
> broker
> > > and
> > > > > multiple that say it happens in the consumer.
> > > > >
> > > >
> > >
> >
>

Re: Compression - producer vs topic?

Posted by Dan Hill <qu...@gmail.com>.
We're using protos but there are still a bunch of custom fields where
clients specify redundant strings.

My local test is showing 75% reduction in size if I use zstd or gzip.  I
care the most about Kafka storage costs right now.

On Tue, Mar 15, 2022 at 2:25 PM Liam Clarke-Hutchinson <lc...@redhat.com>
wrote:

> Hi Dan,
>
> Okay, so if you're looking for low latency, I'm guessing that you're using
> a very low linger.ms in the producers? Also, what format are the records?
> If they're already in a binary format like Protobuf or Avro, unless they're
> composed largely of strings, compression may offer little benefit.
>
> With your small records, I'd suggest running some tests with your current
> config with different compression settings - none, snappy, lz4, (don't
> bother with gzip unless that's all you have) and checking producer metrics
> (available via JMX if you're using the Java clients) for avg-batch-size and
> compression-ratio.
>
> You may just wish to start with no compression, and then consider moving to
> it if/when network bandwidth becomes a bottleneck.
>
> Regards,
>
> Liam
>
> On Tue, 15 Mar 2022 at 17:05, Dan Hill <qu...@gmail.com> wrote:
>
> > Thanks, Liam!
> >
> > I have a mixture of Kafka record size.  10% are large (>100kbs) and 90%
> of
> > the records are smaller than 1kb.  I'm working on a streaming analytics
> > solution that streams impressions, user actions and serving info and
> > combines them together.  End-to-end latency is more important than
> storage
> > size.
> >
> >
> > On Mon, Mar 14, 2022 at 3:27 PM Liam Clarke-Hutchinson <
> > lclarkeh@redhat.com>
> > wrote:
> >
> > > Hi Dan,
> > >
> > > Decompression generally only happens in the broker if the topic has a
> > > particular compression algorithm set, and the producer is using a
> > different
> > > one - then the broker will decompress records from the producer, then
> > > recompress it using the topic's configured algorithm. (The LogCleaner
> > will
> > > also decompress then recompress records when compacting compressed
> > topics).
> > >
> > > The consumer decompresses compressed record batches it receives.
> > >
> > > In my opinion, using topic compression instead of producer compression
> > > would only make sense if the overhead of a few more CPU cycles
> > compression
> > > uses was not tolerable for the producing app. In all of my use cases,
> > > network throughput becomes a bottleneck long before producer
> compression
> > > CPU cost does.
> > >
> > > For your "if X, do Y" formulation I'd say - if your producer is sending
> > > tiny batches, do some analysis of compressed vs. uncompressed size for
> > your
> > > given compression algorithm - you may find that compression overhead
> > > increases batch size for tiny batches.
> > >
> > > If you're sending a large amount of data, do tune your batching and use
> > > compression to reduce data being sent over the wire.
> > >
> > > If you can tell us more about what your problem domain, there might be
> > more
> > > advice that's applicable :)
> > >
> > > Cheers,
> > >
> > > Liam Clarke-Hutchinson
> > >
> > > On Tue, 15 Mar 2022 at 10:05, Dan Hill <qu...@gmail.com> wrote:
> > >
> > > > Hi.  I looked around for advice about Kafka compression.  I've seen
> > mixed
> > > > and conflicting advice.
> > > >
> > > > Is there any sorta "if X, do Y" type of documentation around Kafka
> > > > compression?
> > > >
> > > > Any advice?  Any good posts to read that talk about this trade off?
> > > >
> > > > *Detailed comments*
> > > > I tried looking for producer vs topic compression.  I didn't find
> much.
> > > > Some of the information I see is back from 2011 (which I'm guessing
> is
> > > > pretty stale).
> > > >
> > > > I can guess some potential benefits but I don't know if they are
> > actually
> > > > real.  I've also seen some sites claim certain trade offs but it's
> > > unclear
> > > > if they're true.
> > > >
> > > > It looks like I can modify an existing topic's compression.  I don't
> > know
> > > > if that actually works.  I'd assume it'd just impact data going
> > forward.
> > > >
> > > > I've seen multiple sites say that decompression happens in the broker
> > and
> > > > multiple that say it happens in the consumer.
> > > >
> > >
> >
>

Re: Compression - producer vs topic?

Posted by Liam Clarke-Hutchinson <lc...@redhat.com>.
Hi Dan,

Okay, so if you're looking for low latency, I'm guessing that you're using
a very low linger.ms in the producers? Also, what format are the records?
If they're already in a binary format like Protobuf or Avro, unless they're
composed largely of strings, compression may offer little benefit.

With your small records, I'd suggest running some tests with your current
config with different compression settings - none, snappy, lz4, (don't
bother with gzip unless that's all you have) and checking producer metrics
(available via JMX if you're using the Java clients) for avg-batch-size and
compression-ratio.

You may just wish to start with no compression, and then consider moving to
it if/when network bandwidth becomes a bottleneck.

Regards,

Liam

On Tue, 15 Mar 2022 at 17:05, Dan Hill <qu...@gmail.com> wrote:

> Thanks, Liam!
>
> I have a mixture of Kafka record size.  10% are large (>100kbs) and 90% of
> the records are smaller than 1kb.  I'm working on a streaming analytics
> solution that streams impressions, user actions and serving info and
> combines them together.  End-to-end latency is more important than storage
> size.
>
>
> On Mon, Mar 14, 2022 at 3:27 PM Liam Clarke-Hutchinson <
> lclarkeh@redhat.com>
> wrote:
>
> > Hi Dan,
> >
> > Decompression generally only happens in the broker if the topic has a
> > particular compression algorithm set, and the producer is using a
> different
> > one - then the broker will decompress records from the producer, then
> > recompress it using the topic's configured algorithm. (The LogCleaner
> will
> > also decompress then recompress records when compacting compressed
> topics).
> >
> > The consumer decompresses compressed record batches it receives.
> >
> > In my opinion, using topic compression instead of producer compression
> > would only make sense if the overhead of a few more CPU cycles
> compression
> > uses was not tolerable for the producing app. In all of my use cases,
> > network throughput becomes a bottleneck long before producer compression
> > CPU cost does.
> >
> > For your "if X, do Y" formulation I'd say - if your producer is sending
> > tiny batches, do some analysis of compressed vs. uncompressed size for
> your
> > given compression algorithm - you may find that compression overhead
> > increases batch size for tiny batches.
> >
> > If you're sending a large amount of data, do tune your batching and use
> > compression to reduce data being sent over the wire.
> >
> > If you can tell us more about what your problem domain, there might be
> more
> > advice that's applicable :)
> >
> > Cheers,
> >
> > Liam Clarke-Hutchinson
> >
> > On Tue, 15 Mar 2022 at 10:05, Dan Hill <qu...@gmail.com> wrote:
> >
> > > Hi.  I looked around for advice about Kafka compression.  I've seen
> mixed
> > > and conflicting advice.
> > >
> > > Is there any sorta "if X, do Y" type of documentation around Kafka
> > > compression?
> > >
> > > Any advice?  Any good posts to read that talk about this trade off?
> > >
> > > *Detailed comments*
> > > I tried looking for producer vs topic compression.  I didn't find much.
> > > Some of the information I see is back from 2011 (which I'm guessing is
> > > pretty stale).
> > >
> > > I can guess some potential benefits but I don't know if they are
> actually
> > > real.  I've also seen some sites claim certain trade offs but it's
> > unclear
> > > if they're true.
> > >
> > > It looks like I can modify an existing topic's compression.  I don't
> know
> > > if that actually works.  I'd assume it'd just impact data going
> forward.
> > >
> > > I've seen multiple sites say that decompression happens in the broker
> and
> > > multiple that say it happens in the consumer.
> > >
> >
>

Re: Compression - producer vs topic?

Posted by Dan Hill <qu...@gmail.com>.
Thanks, Liam!

I have a mixture of Kafka record size.  10% are large (>100kbs) and 90% of
the records are smaller than 1kb.  I'm working on a streaming analytics
solution that streams impressions, user actions and serving info and
combines them together.  End-to-end latency is more important than storage
size.


On Mon, Mar 14, 2022 at 3:27 PM Liam Clarke-Hutchinson <lc...@redhat.com>
wrote:

> Hi Dan,
>
> Decompression generally only happens in the broker if the topic has a
> particular compression algorithm set, and the producer is using a different
> one - then the broker will decompress records from the producer, then
> recompress it using the topic's configured algorithm. (The LogCleaner will
> also decompress then recompress records when compacting compressed topics).
>
> The consumer decompresses compressed record batches it receives.
>
> In my opinion, using topic compression instead of producer compression
> would only make sense if the overhead of a few more CPU cycles compression
> uses was not tolerable for the producing app. In all of my use cases,
> network throughput becomes a bottleneck long before producer compression
> CPU cost does.
>
> For your "if X, do Y" formulation I'd say - if your producer is sending
> tiny batches, do some analysis of compressed vs. uncompressed size for your
> given compression algorithm - you may find that compression overhead
> increases batch size for tiny batches.
>
> If you're sending a large amount of data, do tune your batching and use
> compression to reduce data being sent over the wire.
>
> If you can tell us more about what your problem domain, there might be more
> advice that's applicable :)
>
> Cheers,
>
> Liam Clarke-Hutchinson
>
> On Tue, 15 Mar 2022 at 10:05, Dan Hill <qu...@gmail.com> wrote:
>
> > Hi.  I looked around for advice about Kafka compression.  I've seen mixed
> > and conflicting advice.
> >
> > Is there any sorta "if X, do Y" type of documentation around Kafka
> > compression?
> >
> > Any advice?  Any good posts to read that talk about this trade off?
> >
> > *Detailed comments*
> > I tried looking for producer vs topic compression.  I didn't find much.
> > Some of the information I see is back from 2011 (which I'm guessing is
> > pretty stale).
> >
> > I can guess some potential benefits but I don't know if they are actually
> > real.  I've also seen some sites claim certain trade offs but it's
> unclear
> > if they're true.
> >
> > It looks like I can modify an existing topic's compression.  I don't know
> > if that actually works.  I'd assume it'd just impact data going forward.
> >
> > I've seen multiple sites say that decompression happens in the broker and
> > multiple that say it happens in the consumer.
> >
>

Re: Compression - producer vs topic?

Posted by Liam Clarke-Hutchinson <lc...@redhat.com>.
Hi Dan,

Decompression generally only happens in the broker if the topic has a
particular compression algorithm set, and the producer is using a different
one - then the broker will decompress records from the producer, then
recompress it using the topic's configured algorithm. (The LogCleaner will
also decompress then recompress records when compacting compressed topics).

The consumer decompresses compressed record batches it receives.

In my opinion, using topic compression instead of producer compression
would only make sense if the overhead of a few more CPU cycles compression
uses was not tolerable for the producing app. In all of my use cases,
network throughput becomes a bottleneck long before producer compression
CPU cost does.

For your "if X, do Y" formulation I'd say - if your producer is sending
tiny batches, do some analysis of compressed vs. uncompressed size for your
given compression algorithm - you may find that compression overhead
increases batch size for tiny batches.

If you're sending a large amount of data, do tune your batching and use
compression to reduce data being sent over the wire.

If you can tell us more about what your problem domain, there might be more
advice that's applicable :)

Cheers,

Liam Clarke-Hutchinson

On Tue, 15 Mar 2022 at 10:05, Dan Hill <qu...@gmail.com> wrote:

> Hi.  I looked around for advice about Kafka compression.  I've seen mixed
> and conflicting advice.
>
> Is there any sorta "if X, do Y" type of documentation around Kafka
> compression?
>
> Any advice?  Any good posts to read that talk about this trade off?
>
> *Detailed comments*
> I tried looking for producer vs topic compression.  I didn't find much.
> Some of the information I see is back from 2011 (which I'm guessing is
> pretty stale).
>
> I can guess some potential benefits but I don't know if they are actually
> real.  I've also seen some sites claim certain trade offs but it's unclear
> if they're true.
>
> It looks like I can modify an existing topic's compression.  I don't know
> if that actually works.  I'd assume it'd just impact data going forward.
>
> I've seen multiple sites say that decompression happens in the broker and
> multiple that say it happens in the consumer.
>