You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Ewen Cheslack-Postava <ew...@confluent.io> on 2017/02/03 06:40:18 UTC

Re: [DISCUSS] KIP-110: Add Codec for ZStandard Compression

I think attachments get filtered by the Apache mailing lists. You might
need to post it elsewhere. I think you could upload it to the corresponding
JIRA, for example.

-Ewen

On Tue, Jan 31, 2017 at 3:06 AM, Dongjin Lee <do...@apache.org> wrote:

> No problem. I will elaborate my KIP with the updated benchmark & error
> cases.
>
> And here, I resend you the screenshot:
>
>
> ​
> Best,
> Dongjin
>
> On Tue, Jan 31, 2017 at 7:42 PM, Ismael Juma <is...@juma.me.uk> wrote:
>
>> Thanks Dongjin. That seems to make things simpler indeed. It would be good
>> to update the KIP with the relevant details. I suggest describing each
>> flow
>> and the error that the user would see in each case.
>>
>> Also, you mentioned an attached screenshot, but it seems like there was
>> nothing attached to the email.
>>
>> Ismael
>>
>> On Tue, Jan 31, 2017 at 9:28 AM, Dongjin Lee <do...@apache.org> wrote:
>>
>> > Ismael & All,
>> >
>> > After Inspecting the related code & commits, I concluded following:
>> >
>> > 1. We have to update the masking value which is used to retrieve the
>> used
>> > codec id from the messages, to enable the retrieval of the 3rd bit of
>> > compression type field of the message.
>> > 2. The above task is already done; so, we need nothing.
>> >
>> > Here is why.
>> >
>> > Let's start with the first one, with the scenario Juma proposed. In the
>> > case of receiving the message of unsupported compression type, the
>> receiver
>> > (= broker or consumer) raises IllegalArgumentException[^1][^2]. The key
>> > element in this operation is Record#COMPRESSION_CODEC_MASK, which is
>> used
>> > to extract the codec id. We have to update this value from 2-bits
>> extractor
>> > (=0x03) to 3-bits extractor (=0x07).
>> >
>> > But in fact, this task is already done, so its current value is 0x07. We
>> > don't have to update it.
>> >
>> > The reason why this task is already done has some story; From the first
>> > time Record.java file was added to the project[^3], the
>> > COMPRESSION_CODEC_MASK was already 2-bits extractor, that is, 0x03. At
>> that
>> > time, Kafka supported only two compression types - GZipCompression and
>> > SnappyCompression.[^4] After that, KAFKA-1456 introduced two additional
>> > codecs of LZ4 and LZ4C[^5]. This update modified COMPRESSION_CODEC_MASK
>> > into 3 bits extractor, 0x07, in the aim of supporting four compression
>> > codecs.
>> >
>> > Although its following work, KAFKA-1493, removed the support of LZ4C
>> > codec[^6], this mask was not reverted into 2-bits extractor - by this
>> > reason, we don't need to care about the message format.
>> >
>> > Attached screenshot is my test on Juma's scenario. I created topic &
>> sent
>> > some messages using the snapshot version with ZStandard compression and
>> > received the message with the latest version. As you can see, it works
>> > perfectly as expected.
>> >
>> > If you have more opinion to this issue, don't hesitate to send me as a
>> > message.
>> >
>> > Best,
>> > Dongjin
>> >
>> > [^1]: see: Record#compressionType.
>> > [^2]: There is similar routine in Message.scala. But after KAFKA-4390,
>> > that routine is not being used anymore - more precisely, Message class
>> is
>> > now used in ConsoleConsumer only. I think this class should be replaced
>> but
>> > since it is a separated topic, I will send another message for this
>> issue.
>> > [^3]: commit 642da2f (2011.8.2).
>> > [^4]: commit c51b940.
>> > [^5]: commit 547cced.
>> > [^6]: commit 37356bf.
>> >
>> > On Thu, Jan 26, 2017 at 12:35 AM, Ismael Juma <is...@juma.me.uk>
>> wrote:
>> >
>> >> So far the discussion was around the performance characteristics of the
>> >> new
>> >> compression algorithm. Another area that is important and is not
>> covered
>> >> in
>> >> the KIP is the compatibility implications. For example, what happens
>> if a
>> >> consumer that doesn't support zstd tries to consume a topic compressed
>> >> with
>> >> it? Or if a broker that doesn't support receives data compressed with
>> it?
>> >> If we go through that exercise, then more changes may be required (like
>> >> bumping the version of produce/fetch protocols).
>> >>
>> >> Ismael
>> >>
>> >> On Wed, Jan 25, 2017 at 3:22 PM, Ben Stopford <be...@confluent.io>
>> wrote:
>> >>
>> >> > Is there more discussion to be had on this KIP, or should it be taken
>> >> to a
>> >> > vote?
>> >> >
>> >> > On Mon, Jan 16, 2017 at 6:37 AM Dongjin Lee <do...@apache.org>
>> wrote:
>> >> >
>> >> > > I updated KIP-110 with JMH-measured benchmark results. Please have
>> a
>> >> > review
>> >> > > when you are free. (The overall result is not different yet.)
>> >> > >
>> >> > > Regards,
>> >> > > Dongjin
>> >> > >
>> >> > > +1. Could anyone assign KAFKA-4514 to me?
>> >> > >
>> >> > > On Thu, Jan 12, 2017 at 11:39 AM, Dongjin Lee <do...@apache.org>
>> >> > wrote:
>> >> > >
>> >> > > > Okay, I will have a try.
>> >> > > > Thanks Ewen for the guidance!!
>> >> > > >
>> >> > > > Best,
>> >> > > > Dongjin
>> >> > > >
>> >> > > > On Thu, Jan 12, 2017 at 6:44 AM, Ismael Juma <is...@juma.me.uk>
>> >> > wrote:
>> >> > > >
>> >> > > >> That's a good point Ewen. Dongjin, you could use the branch that
>> >> Ewen
>> >> > > >> linked for the performance testing. It would also help validate
>> the
>> >> > PR.
>> >> > > >>
>> >> > > >> Ismael
>> >> > > >>
>> >> > > >> On Wed, Jan 11, 2017 at 9:38 PM, Ewen Cheslack-Postava <
>> >> > > ewen@confluent.io
>> >> > > >> >
>> >> > > >> wrote:
>> >> > > >>
>> >> > > >> > FYI, there's an outstanding patch for getting some JMH
>> >> benchmarking
>> >> > > >> setup:
>> >> > > >> > https://github.com/apache/kafka/pull/1712 I haven't found
>> time
>> >> to
>> >> > > >> review
>> >> > > >> > it
>> >> > > >> > (and don't really know JMH well anyway) but it might be worth
>> >> > getting
>> >> > > >> that
>> >> > > >> > landed so we can use it for this as well.
>> >> > > >> >
>> >> > > >> > -Ewen
>> >> > > >> >
>> >> > > >> > On Wed, Jan 11, 2017 at 6:35 AM, Dongjin Lee <
>> dongjin@apache.org
>> >> >
>> >> > > >> wrote:
>> >> > > >> >
>> >> > > >> > > Hi Ismael,
>> >> > > >> > >
>> >> > > >> > > 1. In the case of compression output, yes, lz4 is producing
>> the
>> >> > > >> smaller
>> >> > > >> > > output than gzip. In fact, my benchmark was inspired
>> >> > > >> > > by MessageCompressionTest#testCompressSize unit test and
>> the
>> >> > result
>> >> > > >> is
>> >> > > >> > > same - 396 bytes for gzip and 387 bytes for lz4.
>> >> > > >> > > 2. I agree that my (former) approach can result in
>> unreliable
>> >> > > output.
>> >> > > >> > > However, I am experiencing difficulties on how to acquire
>> the
>> >> > > >> benchmark
>> >> > > >> > > metrics from Kafka. For you recommended JMH, I just started
>> to
>> >> > > google
>> >> > > >> for
>> >> > > >> > > it. If possible, could you give any example on how to use
>> JMH
>> >> > > against
>> >> > > >> > > Kafka? If it is the case, it will be a great help.
>> >> > > >> > > Regards,Dongjin
>> >> > > >> > >
>> >> > > >> > >                 _____________________________
>> >> > > >> > > From: Ismael Juma <is...@juma.me.uk>
>> >> > > >> > > Sent: Wednesday, January 11, 2017 7:33 PM
>> >> > > >> > > Subject: Re: [DISCUSS] KIP-110: Add Codec for ZStandard
>> >> > Compression
>> >> > > >> > > To:  <de...@kafka.apache.org>
>> >> > > >> > >
>> >> > > >> > >
>> >> > > >> > > Thanks Dongjin. I highly recommend using JMH for the
>> benchmark,
>> >> > the
>> >> > > >> > > existing one has a few problems that could result in
>> unreliable
>> >> > > >> results.
>> >> > > >> > > Also, it's a bit surprising that LZ4 is producing smaller
>> >> output
>> >> > > than
>> >> > > >> > gzip.
>> >> > > >> > > Is that right?
>> >> > > >> > >
>> >> > > >> > > Ismael
>> >> > > >> > >
>> >> > > >> > > On Wed, Jan 11, 2017 at 10:20 AM, Dongjin Lee <
>> >> dongjin@apache.org
>> >> > >
>> >> > > >> > wrote:
>> >> > > >> > >
>> >> > > >> > > > Ismael,
>> >> > > >> > > >
>> >> > > >> > > > I pushed the benchmark code I used, with some updates
>> >> > (iteration:
>> >> > > >> 20 ->
>> >> > > >> > > > 1000). I also updated the KIP page with the updated
>> benchmark
>> >> > > >> results.
>> >> > > >> > > > Please take a review when you are free. The attached
>> >> screenshot
>> >> > > >> shows
>> >> > > >> > how
>> >> > > >> > > > to run the benchmarker.
>> >> > > >> > > >
>> >> > > >> > > > Thanks,
>> >> > > >> > > > Dongjin
>> >> > > >> > > >
>> >> > > >> > > > On Tue, Jan 10, 2017 at 8:03 PM, Dongjin Lee <
>> >> > dongjin@apache.org>
>> >> > > >> > wrote:
>> >> > > >> > > >
>> >> > > >> > > >> Ismael,
>> >> > > >> > > >>
>> >> > > >> > > >> I see. Then, I will share the benchmark code I used by
>> >> > tomorrow.
>> >> > > >> > Thanks
>> >> > > >> > > >> for your guidance.
>> >> > > >> > > >>
>> >> > > >> > > >> Best,
>> >> > > >> > > >> Dongjin
>> >> > > >> > > >>
>> >> > > >> > > >> -----
>> >> > > >> > > >>
>> >> > > >> > > >> Dongjin Lee
>> >> > > >> > > >>
>> >> > > >> > > >> Software developer in Line+.
>> >> > > >> > > >> So interested in massive-scale machine learning.
>> >> > > >> > > >>
>> >> > > >> > > >> facebook: www.facebook.com/dongjin.lee.kr
>> >> > > >> > > >> linkedin: kr.linkedin.com/in/dongjinleekr
>> >> > > >> > > >> github: github.com/dongjinleekr
>> >> > > >> > > >> twitter: www.twitter.com/dongjinleekr
>> >> > > >> > > >>
>> >> > > >> > > >>
>> >> > > >> > > >>
>> >> > > >> > > >>
>> >> > > >> > > >> On Tue, Jan 10, 2017 at 7:24 PM +0900, "Ismael Juma" <
>> >> > > >> > ismael@juma.me.uk
>> >> > > >> > > >
>> >> > > >> > > >> wrote:
>> >> > > >> > > >>
>> >> > > >> > > >> Dongjin,
>> >> > > >> > > >>>
>> >> > > >> > > >>> The KIP states:
>> >> > > >> > > >>>
>> >> > > >> > > >>> "I compared the compressed size and compression time of
>> 3
>> >> > > >> 1kb-sized
>> >> > > >> > > >>> messages (3102 bytes in total), with the
>> >> Draft-implementation
>> >> > of
>> >> > > >> > > ZStandard
>> >> > > >> > > >>> Compression Codec and all currently available
>> >> > CompressionCodecs.
>> >> > > >> All
>> >> > > >> > > >>> elapsed times are the average of 20 trials."
>> >> > > >> > > >>>
>> >> > > >> > > >>> But doesn't give any details of how this was
>> implemented.
>> >> Is
>> >> > the
>> >> > > >> > source
>> >> > > >> > > >>> code available somewhere? Micro-benchmarking in the JVM
>> is
>> >> > > pretty
>> >> > > >> > > tricky so
>> >> > > >> > > >>> it needs verification before numbers can be trusted. A
>> >> > > performance
>> >> > > >> > test
>> >> > > >> > > >>> with kafka-producer-perf-test.sh would be nice to have
>> as
>> >> > well,
>> >> > > if
>> >> > > >> > > possible.
>> >> > > >> > > >>>
>> >> > > >> > > >>> Thanks,
>> >> > > >> > > >>> Ismael
>> >> > > >> > > >>>
>> >> > > >> > > >>> On Tue, Jan 10, 2017 at 7:44 AM, Dongjin Lee  wrote:
>> >> > > >> > > >>>
>> >> > > >> > > >>> > Ismael,
>> >> > > >> > > >>> >
>> >> > > >> > > >>> > 1. Is the benchmark in the KIP page not enough? You
>> mean
>> >> we
>> >> > > >> need a
>> >> > > >> > > whole
>> >> > > >> > > >>> > performance test using kafka-producer-perf-test.sh?
>> >> > > >> > > >>> >
>> >> > > >> > > >>> > 2. It seems like no major project is relying on it
>> >> > currently.
>> >> > > >> > > However,
>> >> > > >> > > >>> > after reviewing the code, I concluded that at least
>> this
>> >> > > project
>> >> > > >> > has
>> >> > > >> > > a good
>> >> > > >> > > >>> > test coverage. And for the problem of upstream
>> tracking -
>> >> > > >> although
>> >> > > >> > > there is
>> >> > > >> > > >>> > no significant update on ZStandard to judge this
>> >> problem, it
>> >> > > >> seems
>> >> > > >> > > not bad.
>> >> > > >> > > >>> > If required, I can take responsibility of the tracking
>> >> for
>> >> > > this
>> >> > > >> > > library.
>> >> > > >> > > >>> >
>> >> > > >> > > >>> > Thanks,
>> >> > > >> > > >>> > Dongjin
>> >> > > >> > > >>> >
>> >> > > >> > > >>> > On Tue, Jan 10, 2017 at 7:09 AM, Ismael Juma  wrote:
>> >> > > >> > > >>> >
>> >> > > >> > > >>> > > Thanks for posting the KIP, ZStandard looks like a
>> nice
>> >> > > >> > > improvement over
>> >> > > >> > > >>> > > the existing compression algorithms. A couple of
>> >> > questions:
>> >> > > >> > > >>> > >
>> >> > > >> > > >>> > > 1. Can you please elaborate on the details of the
>> >> > benchmark?
>> >> > > >> > > >>> > > 2. About https://github.com/luben/zstd-jni, can we
>> >> rely
>> >> > on
>> >> > > >> it? A
>> >> > > >> > > few
>> >> > > >> > > >>> > > things
>> >> > > >> > > >>> > > to consider: are there other projects using it,
>> does it
>> >> > have
>> >> > > >> good
>> >> > > >> > > test
>> >> > > >> > > >>> > > coverage, are there performance tests, does it track
>> >> > > upstream
>> >> > > >> > > closely?
>> >> > > >> > > >>> > >
>> >> > > >> > > >>> > > Thanks,
>> >> > > >> > > >>> > > Ismael
>> >> > > >> > > >>> > >
>> >> > > >> > > >>> > > On Fri, Jan 6, 2017 at 2:40 AM, Dongjin Lee  wrote:
>> >> > > >> > > >>> > >
>> >> > > >> > > >>> > > > Hi all,
>> >> > > >> > > >>> > > >
>> >> > > >> > > >>> > > > I've just posted a new KIP "KIP-110: Add Codec for
>> >> > > ZStandard
>> >> > > >> > > >>> > Compression"
>> >> > > >> > > >>> > > > for
>> >> > > >> > > >>> > > > discussion:
>> >> > > >> > > >>> > > >
>> >> > > >> > > >>> > > > https://cwiki.apache.org/confl
>> >> uence/display/KAFKA/KIP-
>> >> > > >> > > >>> > > > 110%3A+Add+Codec+for+ZStandard+Compression
>> >> > > >> > > >>> > > >
>> >> > > >> > > >>> > > > Please have a look when you are free.
>> >> > > >> > > >>> > > >
>> >> > > >> > > >>> > > > Best,
>> >> > > >> > > >>> > > > Dongjin
>> >> > > >> > > >>> > > >
>> >> > > >> > > >>> > > > --
>> >> > > >> > > >>> > > > *Dongjin Lee*
>> >> > > >> > > >>> > > >
>> >> > > >> > > >>> > > >
>> >> > > >> > > >>> > > > *Software developer in Line+.So interested in
>> >> > > massive-scale
>> >> > > >> > > machine
>> >> > > >> > > >>> > > > learning.facebook: www.facebook.com/dongjin.lee.k
>> r
>> >> > > >> > > >>> > > > linkedin:
>> >> > > >> > > >>> > > > kr.linkedin.com/in/dongjinleekr
>> >> > > >> > > >>> > > > github:
>> >> > > >> > > >>> > > > github.com/dongjinleekr
>> >> > > >> > > >>> > > > twitter: www.twitter.com/dongjinleekr
>> >> > > >> > > >>> > > > *
>> >> > > >> > > >>> > > >
>> >> > > >> > > >>> > >
>> >> > > >> > > >>> >
>> >> > > >> > > >>> >
>> >> > > >> > > >>> >
>> >> > > >> > > >>> > --
>> >> > > >> > > >>> > *Dongjin Lee*
>> >> > > >> > > >>> >
>> >> > > >> > > >>> >
>> >> > > >> > > >>> > *Software developer in Line+.So interested in
>> >> massive-scale
>> >> > > >> machine
>> >> > > >> > > >>> > learning.facebook: www.facebook.com/dongjin.lee.kr
>> >> > > >> > > >>> > linkedin:
>> >> > > >> > > >>> > kr.linkedin.com/in/dongjinleekr
>> >> > > >> > > >>> > github:
>> >> > > >> > > >>> > github.com/dongjinleekr
>> >> > > >> > > >>> > twitter: www.twitter.com/dongjinleekr
>> >> > > >> > > >>> > *
>> >> > > >> > > >>> >
>> >> > > >> > > >>>
>> >> > > >> > > >>>
>> >> > > >> > > >
>> >> > > >> > > >
>> >> > > >> > > > --
>> >> > > >> > > > *Dongjin Lee*
>> >> > > >> > > >
>> >> > > >> > > >
>> >> > > >> > > > *Software developer in Line+.So interested in
>> massive-scale
>> >> > > machine
>> >> > > >> > > > learning.facebook: www.facebook.com/dongjin.lee.kr
>> >> > > >> > > > <http://www.facebook.com/dongjin.lee.kr>linkedin:
>> >> > > >> kr.linkedin.com/in/
>> >> > > >> > > dongjinleekr
>> >> > > >> > > > <http://kr.linkedin.com/in/dongjinleekr>github:
>> >> > > >> > > > <http://goog_969573159/>github.com/dongjinleekr
>> >> > > >> > > > <http://github.com/dongjinleekr>twitter:
>> >> > > >> www.twitter.com/dongjinleekr
>> >> > > >> > > > <http://www.twitter.com/dongjinleekr>*
>> >> > > >> > > >
>> >> > > >> > >
>> >> > > >> > >
>> >> > > >> > >
>> >> > > >> > >
>> >> > > >> > >
>> >> > > >> >
>> >> > > >>
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > --
>> >> > > > *Dongjin Lee*
>> >> > > >
>> >> > > >
>> >> > > > *Software developer in Line+.So interested in massive-scale
>> machine
>> >> > > > learning.facebook: www.facebook.com/dongjin.lee.kr
>> >> > > > <http://www.facebook.com/dongjin.lee.kr>linkedin:
>> >> > > kr.linkedin.com/in/dongjinleekr
>> >> > > > <http://kr.linkedin.com/in/dongjinleekr>github:
>> >> > > > <http://goog_969573159/>github.com/dongjinleekr
>> >> > > > <http://github.com/dongjinleekr>twitter:
>> >> www.twitter.com/dongjinleekr
>> >> > > > <http://www.twitter.com/dongjinleekr>*
>> >> > > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > *Dongjin Lee*
>> >> > >
>> >> > >
>> >> > > *Software developer in Line+.So interested in massive-scale machine
>> >> > > learning.facebook: www.facebook.com/dongjin.lee.kr
>> >> > > <http://www.facebook.com/dongjin.lee.kr>linkedin:
>> >> > > kr.linkedin.com/in/dongjinleekr
>> >> > > <http://kr.linkedin.com/in/dongjinleekr>github:
>> >> > > <http://goog_969573159/>github.com/dongjinleekr
>> >> > > <http://github.com/dongjinleekr>twitter:
>> www.twitter.com/dongjinleekr
>> >> > > <http://www.twitter.com/dongjinleekr>*
>> >> > >
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > *Dongjin Lee*
>> >
>> >
>> > *Software developer in Line+.So interested in massive-scale machine
>> > learning.facebook: www.facebook.com/dongjin.lee.kr
>> > <http://www.facebook.com/dongjin.lee.kr>linkedin:
>> kr.linkedin.com/in/dongjinleekr
>> > <http://kr.linkedin.com/in/dongjinleekr>github:
>> > <http://goog_969573159/>github.com/dongjinleekr
>> > <http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr
>> > <http://www.twitter.com/dongjinleekr>*
>> >
>>
>
>
>
> --
> *Dongjin Lee*
>
>
> *Software developer in Line+.So interested in massive-scale machine
> learning.facebook: www.facebook.com/dongjin.lee.kr
> <http://www.facebook.com/dongjin.lee.kr>linkedin: kr.linkedin.com/in/dongjinleekr
> <http://kr.linkedin.com/in/dongjinleekr>github:
> <http://goog_969573159/>github.com/dongjinleekr
> <http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr
> <http://www.twitter.com/dongjinleekr>*
>