You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Dongjin Lee <do...@apache.org> on 2018/05/30 11:31:08 UTC

Re: [DISCUSS] KIP-110: Add Codec for ZStandard Compression

Hello. I updated the KIP page for this issue, following Ismael's guide.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-110%3A+Add+Codec+for+
ZStandard+Compression

The main difference between the previous version is:

- Motivation: Added real-world benchmark. (Thanks to Ivan)
- Accompanying Issues: especially the problems related to the old clients &
brokers.

Please have a look when you are free.

Thanks,
Dongjin

On Sat, Feb 4, 2017 at 11:51 AM, Jeff Widman <je...@netskope.com> wrote:

> *My concern is that this minor improvement (based on the benchmark) over
> LZ4 does not warrant the work of adding support for a new compression codec
> to the broker, official clients and horde of 3rd party clients, including
> upgrade paths, transformations, tests, additional dependencies, etc.*
>
>
> As someone who primarily works with non-Java clients, I couldn't +1 this
> enough.
>
> If there's truly a benefit (and probably there is...the Facebook post
> announcing ZStandard had some fairly compelling benchmarks) then yes, let's
> do this. But otherwise, please no.
>
> On Tue, Jan 31, 2017 at 1:47 AM, Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
> > Hi Dongjin and good work on the KIP,
> >
> > I understand that ZStandard is generally considered an improvement over
> > LZ4, but the
> > benchmark you provided on the KIP-110 wiki doesn't really reflect that,
> and
> > even
> > makes a note that they are comparable:
> > *> As you can see above, ZStandard shows outstanding performance in both
> of
> > compression rate and speed, especially working with the speed-first
> setting
> > (level 1). To the extent that only LZ4 can be compared to ZStandard.*
> >
> > My concern is that this minor improvement (based on the benchmark) over
> LZ4
> > does not warrant the work
> > of adding support for a new compression codec to the broker, official
> > clients and horde of 3rd party clients, including
> > upgrade paths, transformations, tests, additional dependencies, etc.
> >
> > Is it possible to produce more convincing comparisons?
> >
> > Thanks,
> > Magnus
> >
> >
> >
> >
> >
> > 2017-01-31 10:28 GMT+01:00 Dongjin Lee <do...@apache.org>:
> >
> > > Ismael & All,
> > >
> > > After Inspecting the related code & commits, I concluded following:
> > >
> > > 1. We have to update the masking value which is used to retrieve the
> used
> > > codec id from the messages, to enable the retrieval of the 3rd bit of
> > > compression type field of the message.
> > > 2. The above task is already done; so, we need nothing.
> > >
> > > Here is why.
> > >
> > > Let's start with the first one, with the scenario Juma proposed. In the
> > > case of receiving the message of unsupported compression type, the
> > receiver
> > > (= broker or consumer) raises IllegalArgumentException[^1][^2]. The
> key
> > > element in this operation is Record#COMPRESSION_CODEC_MASK, which is
> used
> > > to extract the codec id. We have to update this value from 2-bits
> > extractor
> > > (=0x03) to 3-bits extractor (=0x07).
> > >
> > > But in fact, this task is already done, so its current value is 0x07.
> We
> > > don't have to update it.
> > >
> > > The reason why this task is already done has some story; From the first
> > > time Record.java file was added to the project[^3], the
> > > COMPRESSION_CODEC_MASK was already 2-bits extractor, that is, 0x03. At
> > that
> > > time, Kafka supported only two compression types - GZipCompression and
> > > SnappyCompression.[^4] After that, KAFKA-1456 introduced two additional
> > > codecs of LZ4 and LZ4C[^5]. This update modified COMPRESSION_CODEC_MASK
> > > into 3 bits extractor, 0x07, in the aim of supporting four compression
> > > codecs.
> > >
> > > Although its following work, KAFKA-1493, removed the support of LZ4C
> > > codec[^6], this mask was not reverted into 2-bits extractor - by this
> > > reason, we don't need to care about the message format.
> > >
> > > Attached screenshot is my test on Juma's scenario. I created topic &
> sent
> > > some messages using the snapshot version with ZStandard compression and
> > > received the message with the latest version. As you can see, it works
> > > perfectly as expected.
> > >
> > > If you have more opinion to this issue, don't hesitate to send me as a
> > > message.
> > >
> > > Best,
> > > Dongjin
> > >
> > > [^1]: see: Record#compressionType.
> > > [^2]: There is similar routine in Message.scala. But after KAFKA-4390,
> > > that routine is not being used anymore - more precisely, Message class
> is
> > > now used in ConsoleConsumer only. I think this class should be replaced
> > but
> > > since it is a separated topic, I will send another message for this
> > issue.
> > > [^3]: commit 642da2f (2011.8.2).
> > > [^4]: commit c51b940.
> > > [^5]: commit 547cced.
> > > [^6]: commit 37356bf.
> > >
> > > On Thu, Jan 26, 2017 at 12:35 AM, Ismael Juma <is...@juma.me.uk>
> wrote:
> > >
> > >> So far the discussion was around the performance characteristics of
> the
> > >> new
> > >> compression algorithm. Another area that is important and is not
> covered
> > >> in
> > >> the KIP is the compatibility implications. For example, what happens
> if
> > a
> > >> consumer that doesn't support zstd tries to consume a topic compressed
> > >> with
> > >> it? Or if a broker that doesn't support receives data compressed with
> > it?
> > >> If we go through that exercise, then more changes may be required
> (like
> > >> bumping the version of produce/fetch protocols).
> > >>
> > >> Ismael
> > >>
> > >> On Wed, Jan 25, 2017 at 3:22 PM, Ben Stopford <be...@confluent.io>
> wrote:
> > >>
> > >> > Is there more discussion to be had on this KIP, or should it be
> taken
> > >> to a
> > >> > vote?
> > >> >
> > >> > On Mon, Jan 16, 2017 at 6:37 AM Dongjin Lee <do...@apache.org>
> > wrote:
> > >> >
> > >> > > I updated KIP-110 with JMH-measured benchmark results. Please
> have a
> > >> > review
> > >> > > when you are free. (The overall result is not different yet.)
> > >> > >
> > >> > > Regards,
> > >> > > Dongjin
> > >> > >
> > >> > > +1. Could anyone assign KAFKA-4514 to me?
> > >> > >
> > >> > > On Thu, Jan 12, 2017 at 11:39 AM, Dongjin Lee <dongjin@apache.org
> >
> > >> > wrote:
> > >> > >
> > >> > > > Okay, I will have a try.
> > >> > > > Thanks Ewen for the guidance!!
> > >> > > >
> > >> > > > Best,
> > >> > > > Dongjin
> > >> > > >
> > >> > > > On Thu, Jan 12, 2017 at 6:44 AM, Ismael Juma <ismael@juma.me.uk
> >
> > >> > wrote:
> > >> > > >
> > >> > > >> That's a good point Ewen. Dongjin, you could use the branch
> that
> > >> Ewen
> > >> > > >> linked for the performance testing. It would also help validate
> > the
> > >> > PR.
> > >> > > >>
> > >> > > >> Ismael
> > >> > > >>
> > >> > > >> On Wed, Jan 11, 2017 at 9:38 PM, Ewen Cheslack-Postava <
> > >> > > ewen@confluent.io
> > >> > > >> >
> > >> > > >> wrote:
> > >> > > >>
> > >> > > >> > FYI, there's an outstanding patch for getting some JMH
> > >> benchmarking
> > >> > > >> setup:
> > >> > > >> > https://github.com/apache/kafka/pull/1712 I haven't found
> time
> > >> to
> > >> > > >> review
> > >> > > >> > it
> > >> > > >> > (and don't really know JMH well anyway) but it might be worth
> > >> > getting
> > >> > > >> that
> > >> > > >> > landed so we can use it for this as well.
> > >> > > >> >
> > >> > > >> > -Ewen
> > >> > > >> >
> > >> > > >> > On Wed, Jan 11, 2017 at 6:35 AM, Dongjin Lee <
> > dongjin@apache.org
> > >> >
> > >> > > >> wrote:
> > >> > > >> >
> > >> > > >> > > Hi Ismael,
> > >> > > >> > >
> > >> > > >> > > 1. In the case of compression output, yes, lz4 is producing
> > the
> > >> > > >> smaller
> > >> > > >> > > output than gzip. In fact, my benchmark was inspired
> > >> > > >> > > by MessageCompressionTest#testCompressSize unit test and
> the
> > >> > result
> > >> > > >> is
> > >> > > >> > > same - 396 bytes for gzip and 387 bytes for lz4.
> > >> > > >> > > 2. I agree that my (former) approach can result in
> unreliable
> > >> > > output.
> > >> > > >> > > However, I am experiencing difficulties on how to acquire
> the
> > >> > > >> benchmark
> > >> > > >> > > metrics from Kafka. For you recommended JMH, I just started
> > to
> > >> > > google
> > >> > > >> for
> > >> > > >> > > it. If possible, could you give any example on how to use
> JMH
> > >> > > against
> > >> > > >> > > Kafka? If it is the case, it will be a great help.
> > >> > > >> > > Regards,Dongjin
> > >> > > >> > >
> > >> > > >> > >                 _____________________________
> > >> > > >> > > From: Ismael Juma <is...@juma.me.uk>
> > >> > > >> > > Sent: Wednesday, January 11, 2017 7:33 PM
> > >> > > >> > > Subject: Re: [DISCUSS] KIP-110: Add Codec for ZStandard
> > >> > Compression
> > >> > > >> > > To:  <de...@kafka.apache.org>
> > >> > > >> > >
> > >> > > >> > >
> > >> > > >> > > Thanks Dongjin. I highly recommend using JMH for the
> > benchmark,
> > >> > the
> > >> > > >> > > existing one has a few problems that could result in
> > unreliable
> > >> > > >> results.
> > >> > > >> > > Also, it's a bit surprising that LZ4 is producing smaller
> > >> output
> > >> > > than
> > >> > > >> > gzip.
> > >> > > >> > > Is that right?
> > >> > > >> > >
> > >> > > >> > > Ismael
> > >> > > >> > >
> > >> > > >> > > On Wed, Jan 11, 2017 at 10:20 AM, Dongjin Lee <
> > >> dongjin@apache.org
> > >> > >
> > >> > > >> > wrote:
> > >> > > >> > >
> > >> > > >> > > > Ismael,
> > >> > > >> > > >
> > >> > > >> > > > I pushed the benchmark code I used, with some updates
> > >> > (iteration:
> > >> > > >> 20 ->
> > >> > > >> > > > 1000). I also updated the KIP page with the updated
> > benchmark
> > >> > > >> results.
> > >> > > >> > > > Please take a review when you are free. The attached
> > >> screenshot
> > >> > > >> shows
> > >> > > >> > how
> > >> > > >> > > > to run the benchmarker.
> > >> > > >> > > >
> > >> > > >> > > > Thanks,
> > >> > > >> > > > Dongjin
> > >> > > >> > > >
> > >> > > >> > > > On Tue, Jan 10, 2017 at 8:03 PM, Dongjin Lee <
> > >> > dongjin@apache.org>
> > >> > > >> > wrote:
> > >> > > >> > > >
> > >> > > >> > > >> Ismael,
> > >> > > >> > > >>
> > >> > > >> > > >> I see. Then, I will share the benchmark code I used by
> > >> > tomorrow.
> > >> > > >> > Thanks
> > >> > > >> > > >> for your guidance.
> > >> > > >> > > >>
> > >> > > >> > > >> Best,
> > >> > > >> > > >> Dongjin
> > >> > > >> > > >>
> > >> > > >> > > >> -----
> > >> > > >> > > >>
> > >> > > >> > > >> Dongjin Lee
> > >> > > >> > > >>
> > >> > > >> > > >> Software developer in Line+.
> > >> > > >> > > >> So interested in massive-scale machine learning.
> > >> > > >> > > >>
> > >> > > >> > > >> facebook: www.facebook.com/dongjin.lee.kr
> > >> > > >> > > >> linkedin: kr.linkedin.com/in/dongjinleekr
> > >> > > >> > > >> github: github.com/dongjinleekr
> > >> > > >> > > >> twitter: www.twitter.com/dongjinleekr
> > >> > > >> > > >>
> > >> > > >> > > >>
> > >> > > >> > > >>
> > >> > > >> > > >>
> > >> > > >> > > >> On Tue, Jan 10, 2017 at 7:24 PM +0900, "Ismael Juma" <
> > >> > > >> > ismael@juma.me.uk
> > >> > > >> > > >
> > >> > > >> > > >> wrote:
> > >> > > >> > > >>
> > >> > > >> > > >> Dongjin,
> > >> > > >> > > >>>
> > >> > > >> > > >>> The KIP states:
> > >> > > >> > > >>>
> > >> > > >> > > >>> "I compared the compressed size and compression time
> of 3
> > >> > > >> 1kb-sized
> > >> > > >> > > >>> messages (3102 bytes in total), with the
> > >> Draft-implementation
> > >> > of
> > >> > > >> > > ZStandard
> > >> > > >> > > >>> Compression Codec and all currently available
> > >> > CompressionCodecs.
> > >> > > >> All
> > >> > > >> > > >>> elapsed times are the average of 20 trials."
> > >> > > >> > > >>>
> > >> > > >> > > >>> But doesn't give any details of how this was
> implemented.
> > >> Is
> > >> > the
> > >> > > >> > source
> > >> > > >> > > >>> code available somewhere? Micro-benchmarking in the JVM
> > is
> > >> > > pretty
> > >> > > >> > > tricky so
> > >> > > >> > > >>> it needs verification before numbers can be trusted. A
> > >> > > performance
> > >> > > >> > test
> > >> > > >> > > >>> with kafka-producer-perf-test.sh would be nice to have
> as
> > >> > well,
> > >> > > if
> > >> > > >> > > possible.
> > >> > > >> > > >>>
> > >> > > >> > > >>> Thanks,
> > >> > > >> > > >>> Ismael
> > >> > > >> > > >>>
> > >> > > >> > > >>> On Tue, Jan 10, 2017 at 7:44 AM, Dongjin Lee  wrote:
> > >> > > >> > > >>>
> > >> > > >> > > >>> > Ismael,
> > >> > > >> > > >>> >
> > >> > > >> > > >>> > 1. Is the benchmark in the KIP page not enough? You
> > mean
> > >> we
> > >> > > >> need a
> > >> > > >> > > whole
> > >> > > >> > > >>> > performance test using kafka-producer-perf-test.sh?
> > >> > > >> > > >>> >
> > >> > > >> > > >>> > 2. It seems like no major project is relying on it
> > >> > currently.
> > >> > > >> > > However,
> > >> > > >> > > >>> > after reviewing the code, I concluded that at least
> > this
> > >> > > project
> > >> > > >> > has
> > >> > > >> > > a good
> > >> > > >> > > >>> > test coverage. And for the problem of upstream
> > tracking -
> > >> > > >> although
> > >> > > >> > > there is
> > >> > > >> > > >>> > no significant update on ZStandard to judge this
> > >> problem, it
> > >> > > >> seems
> > >> > > >> > > not bad.
> > >> > > >> > > >>> > If required, I can take responsibility of the
> tracking
> > >> for
> > >> > > this
> > >> > > >> > > library.
> > >> > > >> > > >>> >
> > >> > > >> > > >>> > Thanks,
> > >> > > >> > > >>> > Dongjin
> > >> > > >> > > >>> >
> > >> > > >> > > >>> > On Tue, Jan 10, 2017 at 7:09 AM, Ismael Juma  wrote:
> > >> > > >> > > >>> >
> > >> > > >> > > >>> > > Thanks for posting the KIP, ZStandard looks like a
> > nice
> > >> > > >> > > improvement over
> > >> > > >> > > >>> > > the existing compression algorithms. A couple of
> > >> > questions:
> > >> > > >> > > >>> > >
> > >> > > >> > > >>> > > 1. Can you please elaborate on the details of the
> > >> > benchmark?
> > >> > > >> > > >>> > > 2. About https://github.com/luben/zstd-jni, can we
> > >> rely
> > >> > on
> > >> > > >> it? A
> > >> > > >> > > few
> > >> > > >> > > >>> > > things
> > >> > > >> > > >>> > > to consider: are there other projects using it,
> does
> > it
> > >> > have
> > >> > > >> good
> > >> > > >> > > test
> > >> > > >> > > >>> > > coverage, are there performance tests, does it
> track
> > >> > > upstream
> > >> > > >> > > closely?
> > >> > > >> > > >>> > >
> > >> > > >> > > >>> > > Thanks,
> > >> > > >> > > >>> > > Ismael
> > >> > > >> > > >>> > >
> > >> > > >> > > >>> > > On Fri, Jan 6, 2017 at 2:40 AM, Dongjin Lee  wrote:
> > >> > > >> > > >>> > >
> > >> > > >> > > >>> > > > Hi all,
> > >> > > >> > > >>> > > >
> > >> > > >> > > >>> > > > I've just posted a new KIP "KIP-110: Add Codec
> for
> > >> > > ZStandard
> > >> > > >> > > >>> > Compression"
> > >> > > >> > > >>> > > > for
> > >> > > >> > > >>> > > > discussion:
> > >> > > >> > > >>> > > >
> > >> > > >> > > >>> > > > https://cwiki.apache.org/confl
> > >> uence/display/KAFKA/KIP-
> > >> > > >> > > >>> > > > 110%3A+Add+Codec+for+ZStandard+Compression
> > >> > > >> > > >>> > > >
> > >> > > >> > > >>> > > > Please have a look when you are free.
> > >> > > >> > > >>> > > >
> > >> > > >> > > >>> > > > Best,
> > >> > > >> > > >>> > > > Dongjin
> > >> > > >> > > >>> > > >
> > >> > > >> > > >>> > > > --
> > >> > > >> > > >>> > > > *Dongjin Lee*
> > >> > > >> > > >>> > > >
> > >> > > >> > > >>> > > >
> > >> > > >> > > >>> > > > *Software developer in Line+.So interested in
> > >> > > massive-scale
> > >> > > >> > > machine
> > >> > > >> > > >>> > > > learning.facebook: www.facebook.com/dongjin.lee.
> kr
> > >> > > >> > > >>> > > > linkedin:
> > >> > > >> > > >>> > > > kr.linkedin.com/in/dongjinleekr
> > >> > > >> > > >>> > > > github:
> > >> > > >> > > >>> > > > github.com/dongjinleekr
> > >> > > >> > > >>> > > > twitter: www.twitter.com/dongjinleekr
> > >> > > >> > > >>> > > > *
> > >> > > >> > > >>> > > >
> > >> > > >> > > >>> > >
> > >> > > >> > > >>> >
> > >> > > >> > > >>> >
> > >> > > >> > > >>> >
> > >> > > >> > > >>> > --
> > >> > > >> > > >>> > *Dongjin Lee*
> > >> > > >> > > >>> >
> > >> > > >> > > >>> >
> > >> > > >> > > >>> > *Software developer in Line+.So interested in
> > >> massive-scale
> > >> > > >> machine
> > >> > > >> > > >>> > learning.facebook: www.facebook.com/dongjin.lee.kr
> > >> > > >> > > >>> > linkedin:
> > >> > > >> > > >>> > kr.linkedin.com/in/dongjinleekr
> > >> > > >> > > >>> > github:
> > >> > > >> > > >>> > github.com/dongjinleekr
> > >> > > >> > > >>> > twitter: www.twitter.com/dongjinleekr
> > >> > > >> > > >>> > *
> > >> > > >> > > >>> >
> > >> > > >> > > >>>
> > >> > > >> > > >>>
> > >> > > >> > > >
> > >> > > >> > > >
> > >> > > >> > > > --
> > >> > > >> > > > *Dongjin Lee*
> > >> > > >> > > >
> > >> > > >> > > >
> > >> > > >> > > > *Software developer in Line+.So interested in
> massive-scale
> > >> > > machine
> > >> > > >> > > > learning.facebook: www.facebook.com/dongjin.lee.kr
> > >> > > >> > > > <http://www.facebook.com/dongjin.lee.kr>linkedin:
> > >> > > >> kr.linkedin.com/in/
> > >> > > >> > > dongjinleekr
> > >> > > >> > > > <http://kr.linkedin.com/in/dongjinleekr>github:
> > >> > > >> > > > <http://goog_969573159/>github.com/dongjinleekr
> > >> > > >> > > > <http://github.com/dongjinleekr>twitter:
> > >> > > >> www.twitter.com/dongjinleekr
> > >> > > >> > > > <http://www.twitter.com/dongjinleekr>*
> > >> > > >> > > >
> > >> > > >> > >
> > >> > > >> > >
> > >> > > >> > >
> > >> > > >> > >
> > >> > > >> > >
> > >> > > >> >
> > >> > > >>
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > *Dongjin Lee*
> > >> > > >
> > >> > > >
> > >> > > > *Software developer in Line+.So interested in massive-scale
> > machine
> > >> > > > learning.facebook: www.facebook.com/dongjin.lee.kr
> > >> > > > <http://www.facebook.com/dongjin.lee.kr>linkedin:
> > >> > > kr.linkedin.com/in/dongjinleekr
> > >> > > > <http://kr.linkedin.com/in/dongjinleekr>github:
> > >> > > > <http://goog_969573159/>github.com/dongjinleekr
> > >> > > > <http://github.com/dongjinleekr>twitter:
> > >> www.twitter.com/dongjinleekr
> > >> > > > <http://www.twitter.com/dongjinleekr>*
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > *Dongjin Lee*
> > >> > >
> > >> > >
> > >> > > *Software developer in Line+.So interested in massive-scale
> machine
> > >> > > learning.facebook: www.facebook.com/dongjin.lee.kr
> > >> > > <http://www.facebook.com/dongjin.lee.kr>linkedin:
> > >> > > kr.linkedin.com/in/dongjinleekr
> > >> > > <http://kr.linkedin.com/in/dongjinleekr>github:
> > >> > > <http://goog_969573159/>github.com/dongjinleekr
> > >> > > <http://github.com/dongjinleekr>twitter:
> > www.twitter.com/dongjinleekr
> > >> > > <http://www.twitter.com/dongjinleekr>*
> > >> > >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > *Dongjin Lee*
> > >
> > >
> > > *Software developer in Line+.So interested in massive-scale machine
> > > learning.facebook: www.facebook.com/dongjin.lee.kr
> > > <http://www.facebook.com/dongjin.lee.kr>linkedin: kr.linkedin.com/in/
> > dongjinleekr
> > > <http://kr.linkedin.com/in/dongjinleekr>github:
> > > <http://goog_969573159/>github.com/dongjinleekr
> > > <http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr
> > > <http://www.twitter.com/dongjinleekr>*
> > >
> >
>



-- 
*Dongjin Lee*

*A hitchhiker in the mathematical world.*

*github:  <http://goog_969573159/>github.com/dongjinleekr
<http://github.com/dongjinleekr>linkedin: kr.linkedin.com/in/dongjinleekr
<http://kr.linkedin.com/in/dongjinleekr>slideshare:
www.slideshare.net/dongjinleekr
<http://www.slideshare.net/dongjinleekr>*