You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Senthilnathan Muthusamy <se...@microsoft.com.INVALID> on 2019/10/21 17:43:00 UTC

[DISCUSS] KIP-280: Enhanced log compaction

Hi All,

We are bring back the KIP-280 to live with small correct for the discussion & voting. Thanks to previous author Luis Cabral on the KIP-280 initiation and we are taking over to complete and get it into 2.4...

Below is the correction that we made to the existing KIP-280:

  *   Allowing the compact strategy configuration at the topic level as the log compaction is at the topic level and a broker can have multiple topics. This allows the flexibility to have the strategy at both broker level (i.e. for all topics within the broker) and topic level (i.e. for a subset of topics within a broker) as well...

KIP-280: https://cwiki.apache.org/confluence/display/KAFKA/KIP-280%3A+Enhanced+log+compaction
PULL REQUEST: https://github.com/apache/kafka/pull/7528 (unit test coverage in progress)

Previous Thread DISCUSS: https://lists.apache.org/thread.html/79aa6e50d7c737ddf83455dd8063692a535a1afa558620fe1a1496d3@%3Cdev.kafka.apache.org%3E
Previous Thread VOTE: https://lists.apache.org/thread.html/b2ecd73ce849741f0c40b4f801c3f7650583497812713e240e1ac2b7@%3Cdev.kafka.apache.org%3E

Appreciate your timely action.

PS: Initiating a separate thread as I was not able to reply to the existing threads...

Thanks,
Senthil

RE: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

Thanks for pointing it out Eric. Updated the KIP...

Regards,
Senthil

-----Original Message-----
From: Guozhang Wang <wa...@gmail.com> 
Sent: Monday, November 4, 2019 11:52 AM
To: dev <de...@kafka.apache.org>
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Eric,

I think that's a good point, in `Headers.java` we also designed the API to only have `Header lastHeader(String key);`. I think picking the last header for that key would makes more sense since internally it is organized as a list such that newer headers could consider "overwriting" the older headers.


Guozhang

On Mon, Nov 4, 2019 at 11:31 AM Eric Azama <ea...@gmail.com> wrote:

> Hi Senthilnathan,
>
> Regarding Matthias's point 6, what is the reasoning for choosing the 
> first occurrence of the configured header? I believe this corresponds 
> to the oldest value for given key. If there are multiple values for a 
> key, it seems more intuitive that the newest value is the one that 
> should be used for compaction.
>
> Thanks,
> Eric
>
> On Mon, Nov 4, 2019 at 11:00 AM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hello Senthilnathan,
> >
> > Thanks for revamping on the KIP. I have only one comment about the 
> > wiki otherwise LGTM.
> >
> > 1. We should emphasize that the newly introduced config yields to 
> > the existing "log.cleanup.policy", i.e. if the latter's value is 
> > `delete` not `compact`, then the previous config would be ignored.
> >
> >
> > Guozhang
> >
> > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy 
> > <se...@microsoft.com.invalid> wrote:
> >
> > > Hi all,
> > >
> > > I will start the vote thread shortly for this updated KIP. If 
> > > there are any more thoughts I would love to hear them.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > Sent: Thursday, October 31, 2019 3:51 AM
> > > To: dev@kafka.apache.org
> > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Matthias
> > >
> > > Thanks for the response.
> > >
> > > (1) Yes
> > >
> > > (2) Yes, and the config name will be the same (i.e.
> > > `log.cleaner.compaction.strategy` &
> > > `log.cleaner.compaction.strategy.header`) at broker level and 
> > > topic
> level
> > > (to override broker level default compact strategy). Please let me 
> > > know
> > if
> > > we need to keep it in different naming convention. Note: Broker 
> > > level (which will be in the server.properties) configuration is 
> > > optional and default it to offset. Topic level configuration will 
> > > be default to
> broker
> > > level config...
> > >
> > > (3) By this new way, it avoids another config parameter and also 
> > > in feature if any new strategy like header need addition info, no
> additional
> > > config required. As this got discussed already and agreed to have
> > separate
> > > config, I will revert it. KIP updated...
> > >
> > > (4) Done
> > >
> > > (5) Updated
> > >
> > > (6) Updated to pick the first header in the list
> > >
> > > Please let me know if you have any other questions.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Matthias J. Sax <ma...@confluent.io>
> > > Sent: Thursday, October 31, 2019 12:13 AM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Thanks for picking up this KIP, Senthil.
> > >
> > > (1) As far as I remember, the main issue of the original proposal 
> > > was a missing topic level configuration for the compaction 
> > > strategy. With
> this
> > > being addressed, I am in favor of this KIP.
> > >
> > > (2) With regard to (1), it seems we would need a new topic level 
> > > config `compaction.strategy`, and 
> > > `log.cleaner.compaction.strategy` would be
> the
> > > default strategy (ie, broker level config) if a topic does not
> overwrite
> > it?
> > >
> > > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > > parameter and change the accepted values of 
> > > `log.cleaner.compaction.strategy` to "header.<key>" instead of 
> > > keeping "header"? The original approach seems to be cleaner, and I 
> > > think this
> was
> > > discussed on the original discuss thread already.
> > >
> > > (4) Nit: For the "timestamp" compaction strategy you changed the 
> > > KIP to
> > >
> > > -> `The record [create] timestamp`
> > >
> > > This is miss leading IMHO, because it depends on the broker/log 
> > > configuration `(log.)message.timestamp.type` that can either be 
> > > `CreateTime` or `LogAppendTime` what the actual record timestamp 
> > > is. I would just remove "create" to keep it unspecified.
> > >
> > > (5) Nit: the section "Public Interfaces" should list the newly
> introduced
> > > configs -- configuration parameters are a public interface.
> > >
> > > (6) What do you mean by "first level header lookup"? The term 
> > > "first level" indicates some hierarchy, but headers don't have any 
> > > hierarchy
> --
> > > it's just a list of key-value pairs? If you mean the _order_ of 
> > > the headers, ie, pick the first header in the list that matches 
> > > the key,
> > please
> > > rephrase it to make it clearer.
> > >
> > >
> > >
> > > @Tom: I agree with all you are saying, however, I still think that 
> > > this KIP will improve the overall situation, because everything 
> > > you pointed
> > out
> > > is actually true with offset based compaction, too.
> > >
> > > The KIP is not a silver bullet that solves all issue for 
> > > interleaved writes, but I personally believe, it's a good improvement.
> > >
> > >
> > >
> > > -Matthias
> > >
> > >
> > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > Hi,
> > > >
> > > > Please let me know if anyone has any questions on this updated
> > KIP-280...
> > > >
> > > > Thanks,
> > > >
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > To: dev@kafka.apache.org
> > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Tom,
> > > >
> > > > Sorry for the delayed response.
> > > >
> > > > Regarding the fall back to offset decision for both timestamp &
> header
> > > value is based on the previous author discuss
> > >
> >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd
> 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cse
> nthilm%40microsoft.com%7C7cc015f6f72e4558954708d761607b96%7C72f988bf86
> f141af91ab2d7cd011db47%7C1%7C0%7C637085030011660939&amp;sdata=DqIHIJtH
> Kl1i6s5RAOUdsTkd21RH2feJgMLXkWUOi7E%3D&amp;reserved=0
> > > and as per the discussion, it is really required to avoid duplicates.
> > > >
> > > > And the timestamp strategy is from the original KIP author and 
> > > > we are
> > > keeping it as is.
> > > >
> > > > Finally on the sequence order guarantee by the producer, it is 
> > > > not
> > > feasible on waiting for ack in async / multi-threads/processes
> scenarios
> > > and hence the header sequence based compact strategy with 
> > > producer's responsibility to have a unique sequence generation for 
> > > the topic-partition-key level.
> > > >
> > > > Hoping this clarifies all your questions. Please let us know if 
> > > > you
> > have
> > > any further questions.
> > > >
> > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > > discussion on the original KIP with previous author and it would 
> > > great
> to
> > > hear your inputs as well.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Tom Bentley <tb...@redhat.com>
> > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Senthilnathan,
> > > >
> > > > In the motivation isn't it a little misleading to say "On the
> producer
> > > > side, we clearly preserve an order for the two messages, <K1, 
> > > > V1>
> <K1,
> > > > V2>"? IMHO, the semantics of the producer are clear that having 
> > > > V2>an observed
> > > > order of sending records from different producers is not 
> > > > sufficient
> to
> > > guarantee ordering on the broker. You really need to send the 2nd
> record
> > > only after the 1st record is acked. It's the difficultly of 
> > > achieving
> > that
> > > in practice that's the true motivation for your KIP.
> > > >
> > > > I can see the attraction of using timestamps, but it would be 
> > > > helpful
> > to
> > > explain how that really solves the problem. When the producers are 
> > > in different processes on different machines you're relying on 
> > > their
> clocks
> > > being synchronized, which is a whole subject in itself. Even if 
> > > they're synchronized the resolution of System.currentTimeMillis() 
> > > is typically
> > many
> > > milliseconds. If your producers are in different threads of the 
> > > same process that could be a real problem because it makes ties 
> > > quite
> likely.
> > > > And you don't explain why it's OK to resolve ties using the offset.
> The
> > > basis of your argument is that the offset is giving you the wrong
> answer.
> > > > So it seems to me that using it as a tiebreaker is just 
> > > > narrowing the
> > > chances of getting the wrong answer. Maybe none of this matters 
> > > for
> your
> > > use case, but I think it should be spelled out in the KIP, because 
> > > it surely would matter for similar use cases.
> > > >
> > > > Using a sequence at least removes the problem of ties, but the
> > > interesting bit is now in how you deal with races between
> > threads/processes
> > > in getting a sequence number allocated (which is out of scope of 
> > > the
> > KIP, I
> > > guess).
> > > > How is resolving that race any simpler that resolving the 
> > > > motivating
> > > race by waiting for the ack of the first record sent?
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > senthilm@microsoft.com.invalid> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> We are bring back the KIP-280 to live with small correct for 
> > > >> the discussion & voting. Thanks to previous author Luis Cabral 
> > > >> on the
> > > >> KIP-280 initiation and we are taking over to complete and get 
> > > >> it
> into
> > > 2.4...
> > > >>
> > > >> Below is the correction that we made to the existing KIP-280:
> > > >>
> > > >>   *   Allowing the compact strategy configuration at the topic level
> > as
> > > >> the log compaction is at the topic level and a broker can have 
> > > >> multiple topics. This allows the flexibility to have the 
> > > >> strategy at both broker level (i.e. for all topics within the 
> > > >> broker) and topic level (i.e. for a subset of topics within a broker) as well...
> > > >>
> > > >> KIP-280:
> > > >>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwi
> > > >> k
> > > >> i.apache.org
> %2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanced
> > > >> %
> > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com
> %7C686c
> > > >> 3
> > > >>
> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C
> > > >> 0
> > > >>
> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7Gs
> > > >> 6
> > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > >>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> > > >> h
> > > >> ub.com
> %2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%40m
> > > >> i
> > > >> crosoft.com
> %7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab
> > > >> 2
> > > >>
> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohEWp
> > > >> t
> > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage 
> > > >> in
> > > >> progress)
> > > >>
> > > >> Previous Thread DISCUSS:
> > > >>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
> > > >> t
> > > >> s.apache.org
> %2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1afa
> > > >> 5
> > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org
> %253E&amp;data=02%7C01%7Cs
> > > >> e
> > > >> nthilm%40microsoft.com
> %7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
> > > >> 6
> > > >>
> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUWWY
> > > >> D
> > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > >> Previous Thread VOTE:
> > > >>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
> > > >> t
> > > >> s.apache.org
> %2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f76505834978
> > > >> 1
> > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org
> %253E&amp;data=02%7C01%7Cs
> > > >> e
> > > >> nthilm%40microsoft.com
> %7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
> > > >> 6
> > > >>
> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQcAm
> > > >> 2
> > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > >>
> > > >> Appreciate your timely action.
> > > >>
> > > >> PS: Initiating a separate thread as I was not able to reply to 
> > > >> the existing threads...
> > > >>
> > > >> Thanks,
> > > >> Senthil
> > > >>
> > >
> > >
> >
> > --
> > -- Guozhang
> >
>


--
-- Guozhang

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Guozhang Wang <wa...@gmail.com>.

Eric,

I think that's a good point, in `Headers.java` we also designed the API to
only have `Header lastHeader(String key);`. I think picking the last header
for that key would makes more sense since internally it is organized as a
list such that newer headers could consider "overwriting" the older headers.


Guozhang

On Mon, Nov 4, 2019 at 11:31 AM Eric Azama <ea...@gmail.com> wrote:

> Hi Senthilnathan,
>
> Regarding Matthias's point 6, what is the reasoning for choosing the first
> occurrence of the configured header? I believe this corresponds to the
> oldest value for given key. If there are multiple values for a key, it
> seems more intuitive that the newest value is the one that should be used
> for compaction.
>
> Thanks,
> Eric
>
> On Mon, Nov 4, 2019 at 11:00 AM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hello Senthilnathan,
> >
> > Thanks for revamping on the KIP. I have only one comment about the wiki
> > otherwise LGTM.
> >
> > 1. We should emphasize that the newly introduced config yields to the
> > existing "log.cleanup.policy", i.e. if the latter's value is `delete` not
> > `compact`, then the previous config would be ignored.
> >
> >
> > Guozhang
> >
> > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy
> > <se...@microsoft.com.invalid> wrote:
> >
> > > Hi all,
> > >
> > > I will start the vote thread shortly for this updated KIP. If there are
> > > any more thoughts I would love to hear them.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > Sent: Thursday, October 31, 2019 3:51 AM
> > > To: dev@kafka.apache.org
> > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Matthias
> > >
> > > Thanks for the response.
> > >
> > > (1) Yes
> > >
> > > (2) Yes, and the config name will be the same (i.e.
> > > `log.cleaner.compaction.strategy` &
> > > `log.cleaner.compaction.strategy.header`) at broker level and topic
> level
> > > (to override broker level default compact strategy). Please let me know
> > if
> > > we need to keep it in different naming convention. Note: Broker level
> > > (which will be in the server.properties) configuration is optional and
> > > default it to offset. Topic level configuration will be default to
> broker
> > > level config...
> > >
> > > (3) By this new way, it avoids another config parameter and also in
> > > feature if any new strategy like header need addition info, no
> additional
> > > config required. As this got discussed already and agreed to have
> > separate
> > > config, I will revert it. KIP updated...
> > >
> > > (4) Done
> > >
> > > (5) Updated
> > >
> > > (6) Updated to pick the first header in the list
> > >
> > > Please let me know if you have any other questions.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Matthias J. Sax <ma...@confluent.io>
> > > Sent: Thursday, October 31, 2019 12:13 AM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Thanks for picking up this KIP, Senthil.
> > >
> > > (1) As far as I remember, the main issue of the original proposal was a
> > > missing topic level configuration for the compaction strategy. With
> this
> > > being addressed, I am in favor of this KIP.
> > >
> > > (2) With regard to (1), it seems we would need a new topic level config
> > > `compaction.strategy`, and `log.cleaner.compaction.strategy` would be
> the
> > > default strategy (ie, broker level config) if a topic does not
> overwrite
> > it?
> > >
> > > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > > parameter and change the accepted values of
> > > `log.cleaner.compaction.strategy` to "header.<key>" instead of keeping
> > > "header"? The original approach seems to be cleaner, and I think this
> was
> > > discussed on the original discuss thread already.
> > >
> > > (4) Nit: For the "timestamp" compaction strategy you changed the KIP to
> > >
> > > -> `The record [create] timestamp`
> > >
> > > This is miss leading IMHO, because it depends on the broker/log
> > > configuration `(log.)message.timestamp.type` that can either be
> > > `CreateTime` or `LogAppendTime` what the actual record timestamp is. I
> > > would just remove "create" to keep it unspecified.
> > >
> > > (5) Nit: the section "Public Interfaces" should list the newly
> introduced
> > > configs -- configuration parameters are a public interface.
> > >
> > > (6) What do you mean by "first level header lookup"? The term "first
> > > level" indicates some hierarchy, but headers don't have any hierarchy
> --
> > > it's just a list of key-value pairs? If you mean the _order_ of the
> > > headers, ie, pick the first header in the list that matches the key,
> > please
> > > rephrase it to make it clearer.
> > >
> > >
> > >
> > > @Tom: I agree with all you are saying, however, I still think that this
> > > KIP will improve the overall situation, because everything you pointed
> > out
> > > is actually true with offset based compaction, too.
> > >
> > > The KIP is not a silver bullet that solves all issue for interleaved
> > > writes, but I personally believe, it's a good improvement.
> > >
> > >
> > >
> > > -Matthias
> > >
> > >
> > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > Hi,
> > > >
> > > > Please let me know if anyone has any questions on this updated
> > KIP-280...
> > > >
> > > > Thanks,
> > > >
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > To: dev@kafka.apache.org
> > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Tom,
> > > >
> > > > Sorry for the delayed response.
> > > >
> > > > Regarding the fall back to offset decision for both timestamp &
> header
> > > value is based on the previous author discuss
> > >
> >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Csenthilm%40microsoft.com%7Cb5c596140be1436e9fb708d75df04714%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637081159484181661&amp;sdata=%2Fap4F2CdPQe02wNDGkzjzIrxOQRTa2KraQE75dpjTzE%3D&amp;reserved=0
> > > and as per the discussion, it is really required to avoid duplicates.
> > > >
> > > > And the timestamp strategy is from the original KIP author and we are
> > > keeping it as is.
> > > >
> > > > Finally on the sequence order guarantee by the producer, it is not
> > > feasible on waiting for ack in async / multi-threads/processes
> scenarios
> > > and hence the header sequence based compact strategy with producer's
> > > responsibility to have a unique sequence generation for the
> > > topic-partition-key level.
> > > >
> > > > Hoping this clarifies all your questions. Please let us know if you
> > have
> > > any further questions.
> > > >
> > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > > discussion on the original KIP with previous author and it would great
> to
> > > hear your inputs as well.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Tom Bentley <tb...@redhat.com>
> > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Senthilnathan,
> > > >
> > > > In the motivation isn't it a little misleading to say "On the
> producer
> > > > side, we clearly preserve an order for the two messages, <K1, V1>
> <K1,
> > > > V2>"? IMHO, the semantics of the producer are clear that having an
> > > > V2>observed
> > > > order of sending records from different producers is not sufficient
> to
> > > guarantee ordering on the broker. You really need to send the 2nd
> record
> > > only after the 1st record is acked. It's the difficultly of achieving
> > that
> > > in practice that's the true motivation for your KIP.
> > > >
> > > > I can see the attraction of using timestamps, but it would be helpful
> > to
> > > explain how that really solves the problem. When the producers are in
> > > different processes on different machines you're relying on their
> clocks
> > > being synchronized, which is a whole subject in itself. Even if they're
> > > synchronized the resolution of System.currentTimeMillis() is typically
> > many
> > > milliseconds. If your producers are in different threads of the same
> > > process that could be a real problem because it makes ties quite
> likely.
> > > > And you don't explain why it's OK to resolve ties using the offset.
> The
> > > basis of your argument is that the offset is giving you the wrong
> answer.
> > > > So it seems to me that using it as a tiebreaker is just narrowing the
> > > chances of getting the wrong answer. Maybe none of this matters for
> your
> > > use case, but I think it should be spelled out in the KIP, because it
> > > surely would matter for similar use cases.
> > > >
> > > > Using a sequence at least removes the problem of ties, but the
> > > interesting bit is now in how you deal with races between
> > threads/processes
> > > in getting a sequence number allocated (which is out of scope of the
> > KIP, I
> > > guess).
> > > > How is resolving that race any simpler that resolving the motivating
> > > race by waiting for the ack of the first record sent?
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > senthilm@microsoft.com.invalid> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> We are bring back the KIP-280 to live with small correct for the
> > > >> discussion & voting. Thanks to previous author Luis Cabral on the
> > > >> KIP-280 initiation and we are taking over to complete and get it
> into
> > > 2.4...
> > > >>
> > > >> Below is the correction that we made to the existing KIP-280:
> > > >>
> > > >>   *   Allowing the compact strategy configuration at the topic level
> > as
> > > >> the log compaction is at the topic level and a broker can have
> > > >> multiple topics. This allows the flexibility to have the strategy at
> > > >> both broker level (i.e. for all topics within the broker) and topic
> > > >> level (i.e. for a subset of topics within a broker) as well...
> > > >>
> > > >> KIP-280:
> > > >>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwi
> > > >> k
> > > >> i.apache.org
> %2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanced
> > > >> %
> > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com
> %7C686c
> > > >> 3
> > > >>
> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C
> > > >> 0
> > > >>
> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7Gs
> > > >> 6
> > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > >>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> > > >> h
> > > >> ub.com
> %2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%40m
> > > >> i
> > > >> crosoft.com
> %7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab
> > > >> 2
> > > >>
> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohEWp
> > > >> t
> > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
> > > >> progress)
> > > >>
> > > >> Previous Thread DISCUSS:
> > > >>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
> > > >> t
> > > >> s.apache.org
> %2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1afa
> > > >> 5
> > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org
> %253E&amp;data=02%7C01%7Cs
> > > >> e
> > > >> nthilm%40microsoft.com
> %7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
> > > >> 6
> > > >>
> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUWWY
> > > >> D
> > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > >> Previous Thread VOTE:
> > > >>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
> > > >> t
> > > >> s.apache.org
> %2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f76505834978
> > > >> 1
> > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org
> %253E&amp;data=02%7C01%7Cs
> > > >> e
> > > >> nthilm%40microsoft.com
> %7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
> > > >> 6
> > > >>
> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQcAm
> > > >> 2
> > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > >>
> > > >> Appreciate your timely action.
> > > >>
> > > >> PS: Initiating a separate thread as I was not able to reply to the
> > > >> existing threads...
> > > >>
> > > >> Thanks,
> > > >> Senthil
> > > >>
> > >
> > >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Eric Azama <ea...@gmail.com>.

Hi Senthilnathan,

Regarding Matthias's point 6, what is the reasoning for choosing the first
occurrence of the configured header? I believe this corresponds to the
oldest value for given key. If there are multiple values for a key, it
seems more intuitive that the newest value is the one that should be used
for compaction.

Thanks,
Eric

On Mon, Nov 4, 2019 at 11:00 AM Guozhang Wang <wa...@gmail.com> wrote:

> Hello Senthilnathan,
>
> Thanks for revamping on the KIP. I have only one comment about the wiki
> otherwise LGTM.
>
> 1. We should emphasize that the newly introduced config yields to the
> existing "log.cleanup.policy", i.e. if the latter's value is `delete` not
> `compact`, then the previous config would be ignored.
>
>
> Guozhang
>
> On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy
> <se...@microsoft.com.invalid> wrote:
>
> > Hi all,
> >
> > I will start the vote thread shortly for this updated KIP. If there are
> > any more thoughts I would love to hear them.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > Sent: Thursday, October 31, 2019 3:51 AM
> > To: dev@kafka.apache.org
> > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi Matthias
> >
> > Thanks for the response.
> >
> > (1) Yes
> >
> > (2) Yes, and the config name will be the same (i.e.
> > `log.cleaner.compaction.strategy` &
> > `log.cleaner.compaction.strategy.header`) at broker level and topic level
> > (to override broker level default compact strategy). Please let me know
> if
> > we need to keep it in different naming convention. Note: Broker level
> > (which will be in the server.properties) configuration is optional and
> > default it to offset. Topic level configuration will be default to broker
> > level config...
> >
> > (3) By this new way, it avoids another config parameter and also in
> > feature if any new strategy like header need addition info, no additional
> > config required. As this got discussed already and agreed to have
> separate
> > config, I will revert it. KIP updated...
> >
> > (4) Done
> >
> > (5) Updated
> >
> > (6) Updated to pick the first header in the list
> >
> > Please let me know if you have any other questions.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Matthias J. Sax <ma...@confluent.io>
> > Sent: Thursday, October 31, 2019 12:13 AM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Thanks for picking up this KIP, Senthil.
> >
> > (1) As far as I remember, the main issue of the original proposal was a
> > missing topic level configuration for the compaction strategy. With this
> > being addressed, I am in favor of this KIP.
> >
> > (2) With regard to (1), it seems we would need a new topic level config
> > `compaction.strategy`, and `log.cleaner.compaction.strategy` would be the
> > default strategy (ie, broker level config) if a topic does not overwrite
> it?
> >
> > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > parameter and change the accepted values of
> > `log.cleaner.compaction.strategy` to "header.<key>" instead of keeping
> > "header"? The original approach seems to be cleaner, and I think this was
> > discussed on the original discuss thread already.
> >
> > (4) Nit: For the "timestamp" compaction strategy you changed the KIP to
> >
> > -> `The record [create] timestamp`
> >
> > This is miss leading IMHO, because it depends on the broker/log
> > configuration `(log.)message.timestamp.type` that can either be
> > `CreateTime` or `LogAppendTime` what the actual record timestamp is. I
> > would just remove "create" to keep it unspecified.
> >
> > (5) Nit: the section "Public Interfaces" should list the newly introduced
> > configs -- configuration parameters are a public interface.
> >
> > (6) What do you mean by "first level header lookup"? The term "first
> > level" indicates some hierarchy, but headers don't have any hierarchy --
> > it's just a list of key-value pairs? If you mean the _order_ of the
> > headers, ie, pick the first header in the list that matches the key,
> please
> > rephrase it to make it clearer.
> >
> >
> >
> > @Tom: I agree with all you are saying, however, I still think that this
> > KIP will improve the overall situation, because everything you pointed
> out
> > is actually true with offset based compaction, too.
> >
> > The KIP is not a silver bullet that solves all issue for interleaved
> > writes, but I personally believe, it's a good improvement.
> >
> >
> >
> > -Matthias
> >
> >
> > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > Hi,
> > >
> > > Please let me know if anyone has any questions on this updated
> KIP-280...
> > >
> > > Thanks,
> > >
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > Sent: Monday, October 28, 2019 11:36 PM
> > > To: dev@kafka.apache.org
> > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Tom,
> > >
> > > Sorry for the delayed response.
> > >
> > > Regarding the fall back to offset decision for both timestamp & header
> > value is based on the previous author discuss
> >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Csenthilm%40microsoft.com%7Cb5c596140be1436e9fb708d75df04714%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637081159484181661&amp;sdata=%2Fap4F2CdPQe02wNDGkzjzIrxOQRTa2KraQE75dpjTzE%3D&amp;reserved=0
> > and as per the discussion, it is really required to avoid duplicates.
> > >
> > > And the timestamp strategy is from the original KIP author and we are
> > keeping it as is.
> > >
> > > Finally on the sequence order guarantee by the producer, it is not
> > feasible on waiting for ack in async / multi-threads/processes scenarios
> > and hence the header sequence based compact strategy with producer's
> > responsibility to have a unique sequence generation for the
> > topic-partition-key level.
> > >
> > > Hoping this clarifies all your questions. Please let us know if you
> have
> > any further questions.
> > >
> > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > discussion on the original KIP with previous author and it would great to
> > hear your inputs as well.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Tom Bentley <tb...@redhat.com>
> > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Senthilnathan,
> > >
> > > In the motivation isn't it a little misleading to say "On the producer
> > > side, we clearly preserve an order for the two messages, <K1, V1> <K1,
> > > V2>"? IMHO, the semantics of the producer are clear that having an
> > > V2>observed
> > > order of sending records from different producers is not sufficient to
> > guarantee ordering on the broker. You really need to send the 2nd record
> > only after the 1st record is acked. It's the difficultly of achieving
> that
> > in practice that's the true motivation for your KIP.
> > >
> > > I can see the attraction of using timestamps, but it would be helpful
> to
> > explain how that really solves the problem. When the producers are in
> > different processes on different machines you're relying on their clocks
> > being synchronized, which is a whole subject in itself. Even if they're
> > synchronized the resolution of System.currentTimeMillis() is typically
> many
> > milliseconds. If your producers are in different threads of the same
> > process that could be a real problem because it makes ties quite likely.
> > > And you don't explain why it's OK to resolve ties using the offset. The
> > basis of your argument is that the offset is giving you the wrong answer.
> > > So it seems to me that using it as a tiebreaker is just narrowing the
> > chances of getting the wrong answer. Maybe none of this matters for your
> > use case, but I think it should be spelled out in the KIP, because it
> > surely would matter for similar use cases.
> > >
> > > Using a sequence at least removes the problem of ties, but the
> > interesting bit is now in how you deal with races between
> threads/processes
> > in getting a sequence number allocated (which is out of scope of the
> KIP, I
> > guess).
> > > How is resolving that race any simpler that resolving the motivating
> > race by waiting for the ack of the first record sent?
> > >
> > > Kind regards,
> > >
> > > Tom
> > >
> > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > senthilm@microsoft.com.invalid> wrote:
> > >
> > >> Hi All,
> > >>
> > >> We are bring back the KIP-280 to live with small correct for the
> > >> discussion & voting. Thanks to previous author Luis Cabral on the
> > >> KIP-280 initiation and we are taking over to complete and get it into
> > 2.4...
> > >>
> > >> Below is the correction that we made to the existing KIP-280:
> > >>
> > >>   *   Allowing the compact strategy configuration at the topic level
> as
> > >> the log compaction is at the topic level and a broker can have
> > >> multiple topics. This allows the flexibility to have the strategy at
> > >> both broker level (i.e. for all topics within the broker) and topic
> > >> level (i.e. for a subset of topics within a broker) as well...
> > >>
> > >> KIP-280:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwi
> > >> k
> > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanced
> > >> %
> > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C686c
> > >> 3
> > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C
> > >> 0
> > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7Gs
> > >> 6
> > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> > >> h
> > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%40m
> > >> i
> > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab
> > >> 2
> > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohEWp
> > >> t
> > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
> > >> progress)
> > >>
> > >> Previous Thread DISCUSS:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
> > >> t
> > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1afa
> > >> 5
> > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cs
> > >> e
> > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
> > >> 6
> > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUWWY
> > >> D
> > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > >> Previous Thread VOTE:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
> > >> t
> > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f76505834978
> > >> 1
> > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cs
> > >> e
> > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
> > >> 6
> > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQcAm
> > >> 2
> > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > >>
> > >> Appreciate your timely action.
> > >>
> > >> PS: Initiating a separate thread as I was not able to reply to the
> > >> existing threads...
> > >>
> > >> Thanks,
> > >> Senthil
> > >>
> >
> >
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Thanks for updating the KIP, Senthil.

@Eric: good point about using the last found header for the key instead
of the first!

I don't have any further comments at this point.


-Matthias

On 11/5/19 11:37 AM, Senthilnathan Muthusamy wrote:
> Hi Guozhang,
> 
> Sure and I have made a note in the JIRA item to make sure the wiki is updated.
> 
> Thanks,
> Senthil
> 
> -----Original Message-----
> From: Guozhang Wang <wa...@gmail.com> 
> Sent: Monday, November 4, 2019 11:00 AM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Hello Senthilnathan,
> 
> Thanks for revamping on the KIP. I have only one comment about the wiki otherwise LGTM.
> 
> 1. We should emphasize that the newly introduced config yields to the existing "log.cleanup.policy", i.e. if the latter's value is `delete` not `compact`, then the previous config would be ignored.
> 
> 
> Guozhang
> 
> On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:
> 
>> Hi all,
>>
>> I will start the vote thread shortly for this updated KIP. If there 
>> are any more thoughts I would love to hear them.
>>
>> Thanks,
>> Senthil
>>
>> -----Original Message-----
>> From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
>> Sent: Thursday, October 31, 2019 3:51 AM
>> To: dev@kafka.apache.org
>> Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
>>
>> Hi Matthias
>>
>> Thanks for the response.
>>
>> (1) Yes
>>
>> (2) Yes, and the config name will be the same (i.e.
>> `log.cleaner.compaction.strategy` &
>> `log.cleaner.compaction.strategy.header`) at broker level and topic 
>> level (to override broker level default compact strategy). Please let 
>> me know if we need to keep it in different naming convention. Note: 
>> Broker level (which will be in the server.properties) configuration is 
>> optional and default it to offset. Topic level configuration will be 
>> default to broker level config...
>>
>> (3) By this new way, it avoids another config parameter and also in 
>> feature if any new strategy like header need addition info, no 
>> additional config required. As this got discussed already and agreed 
>> to have separate config, I will revert it. KIP updated...
>>
>> (4) Done
>>
>> (5) Updated
>>
>> (6) Updated to pick the first header in the list
>>
>> Please let me know if you have any other questions.
>>
>> Thanks,
>> Senthil
>>
>> -----Original Message-----
>> From: Matthias J. Sax <ma...@confluent.io>
>> Sent: Thursday, October 31, 2019 12:13 AM
>> To: dev@kafka.apache.org
>> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>>
>> Thanks for picking up this KIP, Senthil.
>>
>> (1) As far as I remember, the main issue of the original proposal was 
>> a missing topic level configuration for the compaction strategy. With 
>> this being addressed, I am in favor of this KIP.
>>
>> (2) With regard to (1), it seems we would need a new topic level 
>> config `compaction.strategy`, and `log.cleaner.compaction.strategy` 
>> would be the default strategy (ie, broker level config) if a topic does not overwrite it?
>>
>> (3) Why did you remove `log.cleaner.compaction.strategy.header`
>> parameter and change the accepted values of 
>> `log.cleaner.compaction.strategy` to "header.<key>" instead of keeping 
>> "header"? The original approach seems to be cleaner, and I think this 
>> was discussed on the original discuss thread already.
>>
>> (4) Nit: For the "timestamp" compaction strategy you changed the KIP 
>> to
>>
>> -> `The record [create] timestamp`
>>
>> This is miss leading IMHO, because it depends on the broker/log 
>> configuration `(log.)message.timestamp.type` that can either be 
>> `CreateTime` or `LogAppendTime` what the actual record timestamp is. I 
>> would just remove "create" to keep it unspecified.
>>
>> (5) Nit: the section "Public Interfaces" should list the newly 
>> introduced configs -- configuration parameters are a public interface.
>>
>> (6) What do you mean by "first level header lookup"? The term "first 
>> level" indicates some hierarchy, but headers don't have any hierarchy 
>> -- it's just a list of key-value pairs? If you mean the _order_ of the 
>> headers, ie, pick the first header in the list that matches the key, 
>> please rephrase it to make it clearer.
>>
>>
>>
>> @Tom: I agree with all you are saying, however, I still think that 
>> this KIP will improve the overall situation, because everything you 
>> pointed out is actually true with offset based compaction, too.
>>
>> The KIP is not a silver bullet that solves all issue for interleaved 
>> writes, but I personally believe, it's a good improvement.
>>
>>
>>
>> -Matthias
>>
>>
>> On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
>>> Hi,
>>>
>>> Please let me know if anyone has any questions on this updated KIP-280...
>>>
>>> Thanks,
>>>
>>> Senthil
>>>
>>> -----Original Message-----
>>> From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
>>> Sent: Monday, October 28, 2019 11:36 PM
>>> To: dev@kafka.apache.org
>>> Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
>>>
>>> Hi Tom,
>>>
>>> Sorry for the delayed response.
>>>
>>> Regarding the fall back to offset decision for both timestamp & 
>>> header
>> value is based on the previous author discuss
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
>> s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd
>> 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cse
>> nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f988bf86
>> f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=WpEW5ylu
>> %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
>> and as per the discussion, it is really required to avoid duplicates.
>>>
>>> And the timestamp strategy is from the original KIP author and we 
>>> are
>> keeping it as is.
>>>
>>> Finally on the sequence order guarantee by the producer, it is not
>> feasible on waiting for ack in async / multi-threads/processes 
>> scenarios and hence the header sequence based compact strategy with 
>> producer's responsibility to have a unique sequence generation for the 
>> topic-partition-key level.
>>>
>>> Hoping this clarifies all your questions. Please let us know if you 
>>> have
>> any further questions.
>>>
>>> @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
>> discussion on the original KIP with previous author and it would great 
>> to hear your inputs as well.
>>>
>>> Thanks,
>>> Senthil
>>>
>>> -----Original Message-----
>>> From: Tom Bentley <tb...@redhat.com>
>>> Sent: Tuesday, October 22, 2019 2:32 AM
>>> To: dev@kafka.apache.org
>>> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>>>
>>> Hi Senthilnathan,
>>>
>>> In the motivation isn't it a little misleading to say "On the 
>>> producer side, we clearly preserve an order for the two messages, 
>>> <K1, V1> <K1,
>>> V2>"? IMHO, the semantics of the producer are clear that having an 
>>> V2>observed
>>> order of sending records from different producers is not sufficient 
>>> to
>> guarantee ordering on the broker. You really need to send the 2nd 
>> record only after the 1st record is acked. It's the difficultly of 
>> achieving that in practice that's the true motivation for your KIP.
>>>
>>> I can see the attraction of using timestamps, but it would be 
>>> helpful to
>> explain how that really solves the problem. When the producers are in 
>> different processes on different machines you're relying on their 
>> clocks being synchronized, which is a whole subject in itself. Even if 
>> they're synchronized the resolution of System.currentTimeMillis() is 
>> typically many milliseconds. If your producers are in different 
>> threads of the same process that could be a real problem because it makes ties quite likely.
>>> And you don't explain why it's OK to resolve ties using the offset. 
>>> The
>> basis of your argument is that the offset is giving you the wrong answer.
>>> So it seems to me that using it as a tiebreaker is just narrowing 
>>> the
>> chances of getting the wrong answer. Maybe none of this matters for 
>> your use case, but I think it should be spelled out in the KIP, 
>> because it surely would matter for similar use cases.
>>>
>>> Using a sequence at least removes the problem of ties, but the
>> interesting bit is now in how you deal with races between 
>> threads/processes in getting a sequence number allocated (which is out 
>> of scope of the KIP, I guess).
>>> How is resolving that race any simpler that resolving the motivating
>> race by waiting for the ack of the first record sent?
>>>
>>> Kind regards,
>>>
>>> Tom
>>>
>>> On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
>> senthilm@microsoft.com.invalid> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We are bring back the KIP-280 to live with small correct for the 
>>>> discussion & voting. Thanks to previous author Luis Cabral on the
>>>> KIP-280 initiation and we are taking over to complete and get it 
>>>> into
>> 2.4...
>>>>
>>>> Below is the correction that we made to the existing KIP-280:
>>>>
>>>>   *   Allowing the compact strategy configuration at the topic level as
>>>> the log compaction is at the topic level and a broker can have 
>>>> multiple topics. This allows the flexibility to have the strategy 
>>>> at both broker level (i.e. for all topics within the broker) and 
>>>> topic level (i.e. for a subset of topics within a broker) as well...
>>>>
>>>> KIP-280:
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fc
>>>> wi
>>>> k
>>>> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanc
>>>> ed
>>>> %
>>>> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C68
>>>> 6c
>>>> 3
>>>> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%
>>>> 7C
>>>> 0
>>>> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7
>>>> Gs
>>>> 6
>>>> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
>>>> it
>>>> h
>>>> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%4
>>>> 0m
>>>> i
>>>> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91
>>>> ab
>>>> 2
>>>> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohE
>>>> Wp
>>>> t
>>>> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
>>>> progress)
>>>>
>>>> Previous Thread DISCUSS:
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
>>>> is
>>>> t
>>>> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1a
>>>> fa
>>>> 5
>>>> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7
>>>> Cs
>>>> e
>>>> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988b
>>>> f8
>>>> 6
>>>> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUW
>>>> WY
>>>> D
>>>> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
>>>> Previous Thread VOTE:
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
>>>> is
>>>> t
>>>> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f765058349
>>>> 78
>>>> 1
>>>> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7
>>>> Cs
>>>> e
>>>> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988b
>>>> f8
>>>> 6
>>>> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQc
>>>> Am
>>>> 2
>>>> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
>>>>
>>>> Appreciate your timely action.
>>>>
>>>> PS: Initiating a separate thread as I was not able to reply to the 
>>>> existing threads...
>>>>
>>>> Thanks,
>>>> Senthil
>>>>
>>
>>
> 
> --
> -- Guozhang
>

Re: [EXTERNAL] Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Guozhang Wang <wa...@gmail.com>.

Hi Radai,

This is an interesting idea indeed. However since this KIP is already voted
I'd also suggest maybe we can discuss about it in a separate thread so that
we do not drag too long on it (as you can tell from the KIP number it has
been proposed for quite some time and it would be great to have such
feature get in and then later consumer further improvements).

As for handling large messages, please correct me if I'm thinking something
wrong (my main knowledge about this come from
https://www.slideshare.net/JiangjieQin/handle-large-messages-in-apache-kafka-58692297).
If we consider (or have already done so) leveraging on headers to split /
stitch segments of a large message, could we still use different values to
indicate the sequence of the segments? E.g. suppose the header key are all
the same, the header values could still be "m1-s1" (message one, segment
one), and the last message of m1 being "m1-s5d" (message one, segment five
and btw it is the end segment) etc.


Guozhang

On Mon, Dec 16, 2019 at 9:36 AM Senthilnathan Muthusamy
<se...@microsoft.com.invalid> wrote:

> Hi Radai
>
> Thanks for the suggestion. This is really cool feature and specific
> scenario on handling the fragments... However I would strongly recommend to
> come up with separate KIP to discuss this scenario so that we will have a
> better design in place. And also not to divert the intent of the current
> KIP...
>
> Appreciate your valuable feedback!
>
> Regards,
> Senthil
>
> -----Original Message-----
> From: radai <ra...@gmail.com>
> Sent: Thursday, December 12, 2019 11:40 AM
> To: dev@kafka.apache.org
> Subject: Re: [EXTERNAL] Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> may I suggest that if, under "header" strategy, multiple records are found
> with identical header values they are ALL kept?
> this would be useful in cases where users send larger payloads than max
> record size to kafka and are forced to fragment them - by setting the same
> header in all fragments it would become possible to properly log-compact
> topics with such fragmented payloads.
>
> On Tue, Nov 26, 2019 at 10:24 PM Senthilnathan Muthusamy <
> senthilm@microsoft.com.invalid> wrote:
> >
> > Thanks Jun for confirming!
> >
> > I have updated the KIP (added recommendation section and special case in
> handling LEO record for non-offset based compaction strategy). Please
> review and let me know if you have any other feedback.
> >
> > Regards,
> > Senthil
> >
> > -----Original Message-----
> > From: Jun Rao <ju...@confluent.io>
> > Sent: Tuesday, November 26, 2019 4:36 PM
> > To: dev <de...@kafka.apache.org>
> > Subject: [EXTERNAL] Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi, Senthil,
> >
> > Sorry for the delay.
> >
> > 51. It seems that we can just remove the last record from the batch, but
> keeps the batch during compaction. The batch level metadata is enough to
> preserve the log end offset.
> >
> > 53. Yes, your understanding is correct. So we could recommend users to
> set "
> > max.compaction.lag.ms" properly if they care about deletes.
> >
> > Could you add both to the KIP?
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Tue, Nov 26, 2019 at 5:09 AM Senthilnathan Muthusamy <
> senthilm@microsoft.com.invalid> wrote:
> >
> > > Hi Gouzhang & Jun,
> > >
> > > Can one of you please confirm/respond to the below mail so that I
> > > will go ahead and update the KIP and proceed.
> > >
> > > Thanks
> > > Senthil
> > >
> > > - Senthil
> > > ________________________________
> > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > Sent: Wednesday, November 20, 2019 5:04:20 PM
> > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > <merging threads>
> > >
> > > Hi Gouzhang & Jun,
> > >
> > > Thanks for the detailed on the scenarios.
> > >
> > > #51 => thanks for the details Gouzhang with example. Does followers
> > > won't be sync'ing LEO as well with leader? If yes, keeping last
> > > record always (without compaction for non-offset scenarios) would
> > > work and this needed only if the new strategy ends up removing LEO
> > > record, right? Also I couldn’t able to retrieve Jason's mail related
> > > to creating an empty message... Can you please forward if you have?
> > > Wondering how that can solve this particular issue unless creating
> > > record for random key that won't conflict with the producer/consumer
> keys for that topic/partition.
> > >
> > > #53 => I see that this can happen for the low produce rate from
> > > remaining ineligible for compaction for an unbounded duration where by
> "
> > > delete.retention.ms" triggers that removes the tombstone record. If
> > > that's the case (please correct me if I am missing any other
> > > scenarios), then we can suggest the Kafka users to have "segment.ms"
> & "
> > > max.compaction.lag.ms" (as compaction won't happen on active
> > > segment) to be smaller than the "delete.retention.ms" and that
> > > should address this scenario, right?
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Jun Rao <ju...@confluent.io>
> > > Sent: Wednesday, November 13, 2019 9:31 AM
> > > To: dev <de...@kafka.apache.org>
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi, Seth,
> > >
> > > 51. The difference is that with the offset compaction strategy, the
> > > message corresponding to the last offset is always the winning
> > > record and will never be removed. But with the new strategies, it's
> > > now possible that the message corresponding to the last offset is a
> > > losing record and needs to be removed.
> > >
> > > 53. Similarly, with the offset compaction strategy, if we see a
> > > non-tombstone record after a tombstone record, the non-tombstone
> > > record is always the winning one. However, with the new strategies,
> > > that non-tombstone record with a larger offset could be a losing
> > > record. The question is then how do we retain the tombstone long
> > > enough so that we could still recognize that the non-tombstone record
> should be ignored.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > -----Original Message-----
> > > From: Guozhang Wang <wa...@gmail.com>
> > > Sent: Tuesday, November 12, 2019 6:09 PM
> > > To: dev <de...@kafka.apache.org>
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hello Senthil,
> > >
> > > Let me try to re-iterate on Jun's comments with some context here:
> > >
> > > 51: today with the offset-only compaction strategy, the last record
> > > of the log (we call it the log-end-record, whose offset is
> > > log-end-offset) would always be preserved and not compacted. This is
> > > kinda important for replication since followers reason about the
> log-end-offset on the leader.
> > > Consider this case: three replicas of a partition, leader 1 and
> > > follower 2 and 3.
> > >
> > > Leader 1 has records a, b, c, d and d is the current last record of
> > > the partition, the current log-end-offset is 3 (assuming record a's
> > > offset is 0).
> > > Follower 2 has replicated a, b, c, d. Log-end-offset is 3 Follower 3
> > > has replicated a, b, c but not yet replicated d. Log-end-offset is 2.
> > >
> > > NOTE that the compaction triggering are independent on brokers, it
> > > is possible that leader 1 triggers compaction and deletes record d,
> > > while other followers have not triggered compaction yet. At this
> > > moment the leader's log becomes a, b, c. Now let's say follower 3
> > > fetch from leader after the compaction, it will no longer see record d.
> > >
> > > Now suppose there's a leader migration and follower 3 becomes the
> > > new leader, it would accept new appends (say, it's e), and record e
> > > would be appended at *offset 3 *on new leader 3's log. But follower
> > > 2's offset 3's record is d still. Later let's say follower 2 also
> > > triggers compaction and also fetches the new record e from new leader
> 3:
> > >
> > > Follower 2's log would be* a(0), b(1), c(2), e(4)* where the numbers
> > > in brackets are offset number; while leader 3's log would be *a(0),
> > > b(1), c(2), e(3)*. Now you see the two logs diverges in offsets,
> > > although their log entries are the same.
> > >
> > > -------------------------------------
> > >
> > > One way to resolve this, is to simply never remove the last message
> > > during compaction. Another way (suggested by Jason in the old VOTE
> > > thread) is to create an empty message batch to "take up" that offset
> slot.
> > >
> > >
> > > 53: Again here's some context on when we can delete a tombstone (null):
> > > during compaction, if we see the latest record for a certain key is
> > > a tombstone we can remove all old records BUT that tombstone itself
> > > cannot be removed immediately since the old records may already be
> > > fetched by some consumers and that tombstone may not be fetched by
> > > consumer yet. Also that tombstone may have not been replicated to
> > > all other followers yet while the old records have already been
> > > replicated. Hence we have some config on the broker to "delay" the
> > > removal of the tombstone itself. You can find this config named
> > > "delete.retention.ms" in
> > >
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fka
> > > fk
> > > a.apache.org%2Fdocumentation%2F%23brokerconfigs&amp;data=02%7C01%7Cs
> > > en
> > > thilm%40microsoft.com%7C38cdb83b7c5e4ce6f94c08d772d1c32f%7C72f988bf8
> > > 6f
> > > 141af91ab2d7cd011db47%7C1%7C0%7C637104117564048775&amp;sdata=g3UzPso
> > > bM
> > > CTS7bqJKvuI7VkFQynhIcY7fOsT3%2FlJ5lg%3D&amp;reserved=0
> > >
> > > Now consider under timestamp / header based compaction strategy: a
> > > later record may still be deprecated by an early tombstone, so if
> > > that tombstone is already removed then the log compaction thread
> > > would not remove that later record and hence the logic would be
> > > broken. That's why we also need consider "delaying" the removal of the
> tombstone in this case.
> > >
> > > Personally I think we can still piggy-back on the "delete.retention.ms
> "
> > > since its default value is 86400000ms == 1 day, and we just need to
> > > document that if you have timestamp / header based compaction, then
> > > it's YOUR responsibility as the Kafka user to make sure that the
> > > timestamp / header out of ordering is smaller than the value of "
> delete.retention.ms".
> > > Otherwise some later records with smaller timestamp / headers may
> > > not be compacted correctly since the tombstone is already gone and
> > > hence we do not have the "proof" to remove it anymore.
> > >
> > >
> > > Does that make sense to you?
> > >
> > > Guozhang
> > >
> > >
> > > On Tue, Nov 12, 2019 at 9:15 AM Senthilnathan Muthusamy <
> > > senthilm@microsoft.com.invalid> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Thanks for the response and please find below the response!
> > > >
> > > > #50 - got it...
> > > >
> > > > #51 - not sure how the last record will be deleted bcoz of this
> > > > new compact strategy. The reason I am asking is, the compaction is
> > > > based out of offsetmap and the new strategy logic is purely within
> > > > the offsetmap... the offsetmap will always keep track of the
> > > > latest offset irrespective of the compaction strategy. You can
> > > > have a look at the PR of the new compaction strategy changes:
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > > > gi
> > > > th
> > > > ub.com%2Fapache%2Fkafka%2Fpull%2F7528%2Ffiles&amp;data=02%7C01%7Cs
> > > > en
> > > > th
> > > > ilm%40microsoft.com%7C9e3a2484adc54d48122408d767de70ab%7C72f988bf8
> > > > 6f
> > > > 14
> > > > 1af91ab2d7cd011db47%7C1%7C0%7C637092077377837652&amp;sdata=j%2FxNb
> > > > Jl
> > > > oj
> > > > YXIk8KdEe%2FIUrmy0iX6BPoNWUMM9rdjvd4%3D&amp;reserved=0
> > > >
> > > > #52 - sure, I have updated JIRA to include this details in the wiki.
> > > >
> > > > #53 - as I am pointed out in #51, the tombstone is abstract to
> > > > this change (i.e. the tombstone is handled within LogCleaner and
> > > > the compact strategy is by the offsetmap). this is what my
> > > > understand on the tombstone based on the code walk-thru... please
> > > > let me know if I am
> > > missing anything here...
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Jun Rao <ju...@confluent.io>
> > > > Sent: Thursday, November 7, 2019 4:32 PM
> > > > To: dev <de...@kafka.apache.org>
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi, Senthil,
> > > >
> > > > Thanks for bringing back this KIP. Overall, this seems like a
> > > > useful feature. A few comments below.
> > > >
> > > > 50. One use case for the timestamp based compaction is to resolve
> > > > conflicts during data center failures. The failover of a data
> > > > center typically happens much longer tha millisec. So, timestamp
> > > > could be enough to determine the value to keep.
> > > >
> > > > 51. With the timestamp/header strategy, it seems that it may now
> > > > be possible that the last record could be removed during compaction.
> > > > For example, if the active segment is empty, the last record in
> > > > the previous segment could be removed due to compaction. A new
> > > > replica then won't see the true end offset of the partition. If
> > > > that replica ever becomes the leader, it could write a different
> > > > record on the same end offset, which will be weird.
> > > >
> > > > 52. With the timestamp/header strategy, the behavior of the
> > > > application may need to change. In particular, the application
> > > > can't just blindly take the record with a larger offset and
> > > > assuming that it's
> > > the value to keep.
> > > > It needs to check the timestamp or the header now. So, it would be
> > > > useful to at least document this.
> > > >
> > > > 53. This also adds complexity for deletes. Currently, we use a
> > > > null payload to indicate a delete tombstone. The tombstone can be
> > > > removed once all previous records with the same key have been
> > > > removed. If the new strategies apply to tombstones, it's not clear
> > > > when a tombstone can be removed since subsequent records could
> > > > have timestamp/sequenceId smaller than that in the tombstone. It
> > > > would be useful to think this through and document the expected
> behavior.
> > > >
> > > > Jun
> > > >
> > > > On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy <
> > > > senthilm@microsoft.com.invalid> wrote:
> > > >
> > > > > Hi Guozhang,
> > > > >
> > > > > Sure and I have made a note in the JIRA item to make sure the
> > > > > wiki is updated.
> > > > >
> > > > > Thanks,
> > > > > Senthil
> > > > >
> > > > > -----Original Message-----
> > > > > From: Guozhang Wang <wa...@gmail.com>
> > > > > Sent: Monday, November 4, 2019 11:00 AM
> > > > > To: dev <de...@kafka.apache.org>
> > > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > > >
> > > > > Hello Senthilnathan,
> > > > >
> > > > > Thanks for revamping on the KIP. I have only one comment about
> > > > > the wiki otherwise LGTM.
> > > > >
> > > > > 1. We should emphasize that the newly introduced config yields
> > > > > to the existing "log.cleanup.policy", i.e. if the latter's value
> > > > > is `delete` not `compact`, then the previous config would be
> ignored.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy <
> > > > > senthilm@microsoft.com.invalid> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I will start the vote thread shortly for this updated KIP. If
> > > > > > there are any more thoughts I would love to hear them.
> > > > > >
> > > > > > Thanks,
> > > > > > Senthil
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > > > Sent: Thursday, October 31, 2019 3:51 AM
> > > > > > To: dev@kafka.apache.org
> > > > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > > > >
> > > > > > Hi Matthias
> > > > > >
> > > > > > Thanks for the response.
> > > > > >
> > > > > > (1) Yes
> > > > > >
> > > > > > (2) Yes, and the config name will be the same (i.e.
> > > > > > `log.cleaner.compaction.strategy` &
> > > > > > `log.cleaner.compaction.strategy.header`) at broker level and
> > > > > > topic level (to override broker level default compact strategy).
> > > > > > Please let me know if we need to keep it in different naming
> > > convention. Note:
> > > > > > Broker level (which will be in the server.properties)
> > > > > > configuration is optional and default it to offset. Topic
> > > > > > level configuration will be default to broker level config...
> > > > > >
> > > > > > (3) By this new way, it avoids another config parameter and
> > > > > > also in feature if any new strategy like header need addition
> > > > > > info, no additional config required. As this got discussed
> > > > > > already and agreed to have separate config, I will revert it.
> KIP updated...
> > > > > >
> > > > > > (4) Done
> > > > > >
> > > > > > (5) Updated
> > > > > >
> > > > > > (6) Updated to pick the first header in the list
> > > > > >
> > > > > > Please let me know if you have any other questions.
> > > > > >
> > > > > > Thanks,
> > > > > > Senthil
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Matthias J. Sax <ma...@confluent.io>
> > > > > > Sent: Thursday, October 31, 2019 12:13 AM
> > > > > > To: dev@kafka.apache.org
> > > > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > > > >
> > > > > > Thanks for picking up this KIP, Senthil.
> > > > > >
> > > > > > (1) As far as I remember, the main issue of the original
> > > > > > proposal was a missing topic level configuration for the
> compaction strategy.
> > > > > > With this being addressed, I am in favor of this KIP.
> > > > > >
> > > > > > (2) With regard to (1), it seems we would need a new topic
> > > > > > level config `compaction.strategy`, and
> > > > > > `log.cleaner.compaction.strategy` would be the default
> > > > > > strategy (ie, broker level config) if a topic does
> > > > > not overwrite it?
> > > > > >
> > > > > > (3) Why did you remove
> > > > > > `log.cleaner.compaction.strategy.header`
> > > > > > parameter and change the accepted values of
> > > > > > `log.cleaner.compaction.strategy` to "header.<key>" instead of
> > > > > > keeping "header"? The original approach seems to be cleaner,
> > > > > > and I think this was discussed on the original discuss thread
> already.
> > > > > >
> > > > > > (4) Nit: For the "timestamp" compaction strategy you changed
> > > > > > the KIP to
> > > > > >
> > > > > > -> `The record [create] timestamp`
> > > > > >
> > > > > > This is miss leading IMHO, because it depends on the
> > > > > > broker/log configuration `(log.)message.timestamp.type` that
> > > > > > can either be `CreateTime` or `LogAppendTime` what the actual
> record timestamp is.
> > > > > > I would just remove "create" to keep it unspecified.
> > > > > >
> > > > > > (5) Nit: the section "Public Interfaces" should list the newly
> > > > > > introduced configs -- configuration parameters are a public
> > > interface.
> > > > > >
> > > > > > (6) What do you mean by "first level header lookup"? The term
> > > > > > "first level" indicates some hierarchy, but headers don't have
> > > > > > any hierarchy
> > > > > > -- it's just a list of key-value pairs? If you mean the
> > > > > > _order_ of the headers, ie, pick the first header in the list
> > > > > > that matches the key, please rephrase it to make it clearer.
> > > > > >
> > > > > >
> > > > > >
> > > > > > @Tom: I agree with all you are saying, however, I still think
> > > > > > that this KIP will improve the overall situation, because
> > > > > > everything you pointed out is actually true with offset based
> compaction, too.
> > > > > >
> > > > > > The KIP is not a silver bullet that solves all issue for
> > > > > > interleaved writes, but I personally believe, it's a good
> > > improvement.
> > > > > >
> > > > > >
> > > > > >
> > > > > > -Matthias
> > > > > >
> > > > > >
> > > > > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > Please let me know if anyone has any questions on this
> > > > > > > updated
> > > > > KIP-280...
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Senthil
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Senthilnathan Muthusamy
> > > > > > > <se...@microsoft.com.INVALID>
> > > > > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > > > > To: dev@kafka.apache.org
> > > > > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > > > > >
> > > > > > > Hi Tom,
> > > > > > >
> > > > > > > Sorry for the delayed response.
> > > > > > >
> > > > > > > Regarding the fall back to offset decision for both
> > > > > > > timestamp & header
> > > > > > value is based on the previous author discuss
> > > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2
> > > > > > F%
> > > > > > 2F
> > > > > > li
> > > > > > st
> > > > > > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a45
> > > > > > 7d
> > > > > > bb
> > > > > > cc
> > > > > > dd
> > > > > > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7
> > > > > > C0
> > > > > > 1%
> > > > > > 7C
> > > > > > se
> > > > > > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72
> > > > > > f9
> > > > > > 88
> > > > > > bf
> > > > > > 86
> > > > > > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=
> > > > > > Wp
> > > > > > EW
> > > > > > 5y
> > > > > > lu
> > > > > > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > > > > > and as per the discussion, it is really required to avoid
> duplicates.
> > > > > > >
> > > > > > > And the timestamp strategy is from the original KIP author
> > > > > > > and we are
> > > > > > keeping it as is.
> > > > > > >
> > > > > > > Finally on the sequence order guarantee by the producer, it
> > > > > > > is not
> > > > > > feasible on waiting for ack in async / multi-threads/processes
> > > > > > scenarios and hence the header sequence based compact strategy
> > > > > > with producer's responsibility to have a unique sequence
> > > > > > generation for the topic-partition-key level.
> > > > > > >
> > > > > > > Hoping this clarifies all your questions. Please let us know
> > > > > > > if you have
> > > > > > any further questions.
> > > > > > >
> > > > > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a
> > > > > > > detail
> > > > > > discussion on the original KIP with previous author and it
> > > > > > would great to hear your inputs as well.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Senthil
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Tom Bentley <tb...@redhat.com>
> > > > > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > > > > To: dev@kafka.apache.org
> > > > > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > > > > >
> > > > > > > Hi Senthilnathan,
> > > > > > >
> > > > > > > In the motivation isn't it a little misleading to say "On
> > > > > > > the producer side, we clearly preserve an order for the two
> > > > > > > messages, <K1, V1> <K1,
> > > > > > > V2>"? IMHO, the semantics of the producer are clear that
> > > > > > > V2>having an observed
> > > > > > > order of sending records from different producers is not
> > > > > > > sufficient to
> > > > > > guarantee ordering on the broker. You really need to send the
> > > > > > 2nd record only after the 1st record is acked. It's the
> > > > > > difficultly of achieving that in practice that's the true
> motivation for your KIP.
> > > > > > >
> > > > > > > I can see the attraction of using timestamps, but it would
> > > > > > > be helpful to
> > > > > > explain how that really solves the problem. When the producers
> > > > > > are in different processes on different machines you're
> > > > > > relying on their clocks being synchronized, which is a whole
> > > > > > subject in itself. Even if they're synchronized the resolution
> > > > > > of
> > > > > > System.currentTimeMillis() is typically many milliseconds. If
> > > > > > your producers are in different threads of the same process
> > > > > > that could be a real problem because it
> > > > > makes ties quite likely.
> > > > > > > And you don't explain why it's OK to resolve ties using the
> offset.
> > > > > > > The
> > > > > > basis of your argument is that the offset is giving you the
> > > > > > wrong
> > > > answer.
> > > > > > > So it seems to me that using it as a tiebreaker is just
> > > > > > > narrowing the
> > > > > > chances of getting the wrong answer. Maybe none of this
> > > > > > matters for your use case, but I think it should be spelled
> > > > > > out in the KIP, because it surely would matter for similar use
> cases.
> > > > > > >
> > > > > > > Using a sequence at least removes the problem of ties, but
> > > > > > > the
> > > > > > interesting bit is now in how you deal with races between
> > > > > > threads/processes in getting a sequence number allocated
> > > > > > (which is out of scope of the KIP, I guess).
> > > > > > > How is resolving that race any simpler that resolving the
> > > > > > > motivating
> > > > > > race by waiting for the ack of the first record sent?
> > > > > > >
> > > > > > > Kind regards,
> > > > > > >
> > > > > > > Tom
> > > > > > >
> > > > > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > > > > senthilm@microsoft.com.invalid> wrote:
> > > > > > >
> > > > > > >> Hi All,
> > > > > > >>
> > > > > > >> We are bring back the KIP-280 to live with small correct
> > > > > > >> for the discussion & voting. Thanks to previous author Luis
> > > > > > >> Cabral on the
> > > > > > >> KIP-280 initiation and we are taking over to complete and
> > > > > > >> get it into
> > > > > > 2.4...
> > > > > > >>
> > > > > > >> Below is the correction that we made to the existing KIP-280:
> > > > > > >>
> > > > > > >>   *   Allowing the compact strategy configuration at the topic
> > > level
> > > > > as
> > > > > > >> the log compaction is at the topic level and a broker can
> > > > > > >> have multiple topics. This allows the flexibility to have
> > > > > > >> the strategy at both broker level (i.e. for all topics
> > > > > > >> within the
> > > > > > >> broker) and topic level (i.e. for a subset of topics within
> > > > > > >> a
> > > broker) as well...
> > > > > > >>
> > > > > > >> KIP-280:
> > > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3
> > > > > > >> A%
> > > > > > >> 2F
> > > > > > >> %2
> > > > > > >> Fc
> > > > > > >> wi
> > > > > > >> k
> > > > > > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%
> > > > > > >> 2B
> > > > > > >> En
> > > > > > >> ha
> > > > > > >> nc
> > > > > > >> ed
> > > > > > >> %
> > > > > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.
> > > > > > >> co
> > > > > > >> m%
> > > > > > >> 7C
> > > > > > >> 68
> > > > > > >> 6c
> > > > > > >> 3
> > > > > > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011d
> > > > > > >> b4
> > > > > > >> 7%
> > > > > > >> 7C
> > > > > > >> 1%
> > > > > > >> 7C
> > > > > > >> 0
> > > > > > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9
> > > > > > >> dk
> > > > > > >> hC
> > > > > > >> eA
> > > > > > >> a7
> > > > > > >> Gs
> > > > > > >> 6
> > > > > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3
> > > > > > >> A%
> > > > > > >> 2F
> > > > > > >> %2
> > > > > > >> Fg
> > > > > > >> it
> > > > > > >> h
> > > > > > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Cse
> > > > > > >> nt
> > > > > > >> hi
> > > > > > >> lm
> > > > > > >> %4
> > > > > > >> 0m
> > > > > > >> i
> > > > > > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86
> > > > > > >> f1
> > > > > > >> 41
> > > > > > >> af
> > > > > > >> 91
> > > > > > >> ab
> > > > > > >> 2
> > > > > > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDU
> > > > > > >> jJ
> > > > > > >> jp
> > > > > > >> Xo
> > > > > > >> hE
> > > > > > >> Wp
> > > > > > >> t
> > > > > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test
> > > > > > >> coverage in
> > > > > > >> progress)
> > > > > > >>
> > > > > > >> Previous Thread DISCUSS:
> > > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3
> > > > > > >> A%
> > > > > > >> 2F
> > > > > > >> %2
> > > > > > >> Fl
> > > > > > >> is
> > > > > > >> t
> > > > > > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd806369
> > > > > > >> 2a
> > > > > > >> 53
> > > > > > >> 5a
> > > > > > >> 1a
> > > > > > >> fa
> > > > > > >> 5
> > > > > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=0
> > > > > > >> 2%
> > > > > > >> 7C
> > > > > > >> 01
> > > > > > >> %7
> > > > > > >> Cs
> > > > > > >> e
> > > > > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7
> > > > > > >> C7
> > > > > > >> 2f
> > > > > > >> 98
> > > > > > >> 8b
> > > > > > >> f8
> > > > > > >> 6
> > > > > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sda
> > > > > > >> ta
> > > > > > >> =X
> > > > > > >> wc
> > > > > > >> UW
> > > > > > >> WY
> > > > > > >> D
> > > > > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > > > > >> Previous Thread VOTE:
> > > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3
> > > > > > >> A%
> > > > > > >> 2F
> > > > > > >> %2
> > > > > > >> Fl
> > > > > > >> is
> > > > > > >> t
> > > > > > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f7
> > > > > > >> 65
> > > > > > >> 05
> > > > > > >> 83
> > > > > > >> 49
> > > > > > >> 78
> > > > > > >> 1
> > > > > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=0
> > > > > > >> 2%
> > > > > > >> 7C
> > > > > > >> 01
> > > > > > >> %7
> > > > > > >> Cs
> > > > > > >> e
> > > > > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7
> > > > > > >> C7
> > > > > > >> 2f
> > > > > > >> 98
> > > > > > >> 8b
> > > > > > >> f8
> > > > > > >> 6
> > > > > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sda
> > > > > > >> ta
> > > > > > >> =8
> > > > > > >> cK
> > > > > > >> Qc
> > > > > > >> Am
> > > > > > >> 2
> > > > > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > > > > >>
> > > > > > >> Appreciate your timely action.
> > > > > > >>
> > > > > > >> PS: Initiating a separate thread as I was not able to reply
> > > > > > >> to the existing threads...
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Senthil
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
>


-- 
-- Guozhang

RE: [EXTERNAL] Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

Hi Radai

Thanks for the suggestion. This is really cool feature and specific scenario on handling the fragments... However I would strongly recommend to come up with separate KIP to discuss this scenario so that we will have a better design in place. And also not to divert the intent of the current KIP...

Appreciate your valuable feedback!

Regards,
Senthil

-----Original Message-----
From: radai <ra...@gmail.com> 
Sent: Thursday, December 12, 2019 11:40 AM
To: dev@kafka.apache.org
Subject: Re: [EXTERNAL] Re: [DISCUSS] KIP-280: Enhanced log compaction

may I suggest that if, under "header" strategy, multiple records are found with identical header values they are ALL kept?
this would be useful in cases where users send larger payloads than max record size to kafka and are forced to fragment them - by setting the same header in all fragments it would become possible to properly log-compact topics with such fragmented payloads.

On Tue, Nov 26, 2019 at 10:24 PM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:
>
> Thanks Jun for confirming!
>
> I have updated the KIP (added recommendation section and special case in handling LEO record for non-offset based compaction strategy). Please review and let me know if you have any other feedback.
>
> Regards,
> Senthil
>
> -----Original Message-----
> From: Jun Rao <ju...@confluent.io>
> Sent: Tuesday, November 26, 2019 4:36 PM
> To: dev <de...@kafka.apache.org>
> Subject: [EXTERNAL] Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hi, Senthil,
>
> Sorry for the delay.
>
> 51. It seems that we can just remove the last record from the batch, but keeps the batch during compaction. The batch level metadata is enough to preserve the log end offset.
>
> 53. Yes, your understanding is correct. So we could recommend users to set "
> max.compaction.lag.ms" properly if they care about deletes.
>
> Could you add both to the KIP?
>
> Thanks,
>
> Jun
>
>
> On Tue, Nov 26, 2019 at 5:09 AM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:
>
> > Hi Gouzhang & Jun,
> >
> > Can one of you please confirm/respond to the below mail so that I 
> > will go ahead and update the KIP and proceed.
> >
> > Thanks
> > Senthil
> >
> > - Senthil
> > ________________________________
> > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > Sent: Wednesday, November 20, 2019 5:04:20 PM
> > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > <merging threads>
> >
> > Hi Gouzhang & Jun,
> >
> > Thanks for the detailed on the scenarios.
> >
> > #51 => thanks for the details Gouzhang with example. Does followers 
> > won't be sync'ing LEO as well with leader? If yes, keeping last 
> > record always (without compaction for non-offset scenarios) would 
> > work and this needed only if the new strategy ends up removing LEO 
> > record, right? Also I couldn’t able to retrieve Jason's mail related 
> > to creating an empty message... Can you please forward if you have?
> > Wondering how that can solve this particular issue unless creating 
> > record for random key that won't conflict with the producer/consumer keys for that topic/partition.
> >
> > #53 => I see that this can happen for the low produce rate from 
> > remaining ineligible for compaction for an unbounded duration where by "
> > delete.retention.ms" triggers that removes the tombstone record. If 
> > that's the case (please correct me if I am missing any other 
> > scenarios), then we can suggest the Kafka users to have "segment.ms" & "
> > max.compaction.lag.ms" (as compaction won't happen on active 
> > segment) to be smaller than the "delete.retention.ms" and that 
> > should address this scenario, right?
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Jun Rao <ju...@confluent.io>
> > Sent: Wednesday, November 13, 2019 9:31 AM
> > To: dev <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi, Seth,
> >
> > 51. The difference is that with the offset compaction strategy, the 
> > message corresponding to the last offset is always the winning 
> > record and will never be removed. But with the new strategies, it's 
> > now possible that the message corresponding to the last offset is a 
> > losing record and needs to be removed.
> >
> > 53. Similarly, with the offset compaction strategy, if we see a 
> > non-tombstone record after a tombstone record, the non-tombstone 
> > record is always the winning one. However, with the new strategies, 
> > that non-tombstone record with a larger offset could be a losing 
> > record. The question is then how do we retain the tombstone long 
> > enough so that we could still recognize that the non-tombstone record should be ignored.
> >
> > Thanks,
> >
> > Jun
> >
> > -----Original Message-----
> > From: Guozhang Wang <wa...@gmail.com>
> > Sent: Tuesday, November 12, 2019 6:09 PM
> > To: dev <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hello Senthil,
> >
> > Let me try to re-iterate on Jun's comments with some context here:
> >
> > 51: today with the offset-only compaction strategy, the last record 
> > of the log (we call it the log-end-record, whose offset is
> > log-end-offset) would always be preserved and not compacted. This is 
> > kinda important for replication since followers reason about the log-end-offset on the leader.
> > Consider this case: three replicas of a partition, leader 1 and 
> > follower 2 and 3.
> >
> > Leader 1 has records a, b, c, d and d is the current last record of 
> > the partition, the current log-end-offset is 3 (assuming record a's 
> > offset is 0).
> > Follower 2 has replicated a, b, c, d. Log-end-offset is 3 Follower 3 
> > has replicated a, b, c but not yet replicated d. Log-end-offset is 2.
> >
> > NOTE that the compaction triggering are independent on brokers, it 
> > is possible that leader 1 triggers compaction and deletes record d, 
> > while other followers have not triggered compaction yet. At this 
> > moment the leader's log becomes a, b, c. Now let's say follower 3 
> > fetch from leader after the compaction, it will no longer see record d.
> >
> > Now suppose there's a leader migration and follower 3 becomes the 
> > new leader, it would accept new appends (say, it's e), and record e 
> > would be appended at *offset 3 *on new leader 3's log. But follower 
> > 2's offset 3's record is d still. Later let's say follower 2 also 
> > triggers compaction and also fetches the new record e from new leader 3:
> >
> > Follower 2's log would be* a(0), b(1), c(2), e(4)* where the numbers 
> > in brackets are offset number; while leader 3's log would be *a(0), 
> > b(1), c(2), e(3)*. Now you see the two logs diverges in offsets, 
> > although their log entries are the same.
> >
> > -------------------------------------
> >
> > One way to resolve this, is to simply never remove the last message 
> > during compaction. Another way (suggested by Jason in the old VOTE
> > thread) is to create an empty message batch to "take up" that offset slot.
> >
> >
> > 53: Again here's some context on when we can delete a tombstone (null):
> > during compaction, if we see the latest record for a certain key is 
> > a tombstone we can remove all old records BUT that tombstone itself 
> > cannot be removed immediately since the old records may already be 
> > fetched by some consumers and that tombstone may not be fetched by 
> > consumer yet. Also that tombstone may have not been replicated to 
> > all other followers yet while the old records have already been 
> > replicated. Hence we have some config on the broker to "delay" the 
> > removal of the tombstone itself. You can find this config named 
> > "delete.retention.ms" in
> >
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fka
> > fk 
> > a.apache.org%2Fdocumentation%2F%23brokerconfigs&amp;data=02%7C01%7Cs
> > en 
> > thilm%40microsoft.com%7C38cdb83b7c5e4ce6f94c08d772d1c32f%7C72f988bf8
> > 6f 
> > 141af91ab2d7cd011db47%7C1%7C0%7C637104117564048775&amp;sdata=g3UzPso
> > bM
> > CTS7bqJKvuI7VkFQynhIcY7fOsT3%2FlJ5lg%3D&amp;reserved=0
> >
> > Now consider under timestamp / header based compaction strategy: a 
> > later record may still be deprecated by an early tombstone, so if 
> > that tombstone is already removed then the log compaction thread 
> > would not remove that later record and hence the logic would be 
> > broken. That's why we also need consider "delaying" the removal of the tombstone in this case.
> >
> > Personally I think we can still piggy-back on the "delete.retention.ms"
> > since its default value is 86400000ms == 1 day, and we just need to 
> > document that if you have timestamp / header based compaction, then 
> > it's YOUR responsibility as the Kafka user to make sure that the 
> > timestamp / header out of ordering is smaller than the value of "delete.retention.ms".
> > Otherwise some later records with smaller timestamp / headers may 
> > not be compacted correctly since the tombstone is already gone and 
> > hence we do not have the "proof" to remove it anymore.
> >
> >
> > Does that make sense to you?
> >
> > Guozhang
> >
> >
> > On Tue, Nov 12, 2019 at 9:15 AM Senthilnathan Muthusamy < 
> > senthilm@microsoft.com.invalid> wrote:
> >
> > > Hi Jun,
> > >
> > > Thanks for the response and please find below the response!
> > >
> > > #50 - got it...
> > >
> > > #51 - not sure how the last record will be deleted bcoz of this 
> > > new compact strategy. The reason I am asking is, the compaction is 
> > > based out of offsetmap and the new strategy logic is purely within 
> > > the offsetmap... the offsetmap will always keep track of the 
> > > latest offset irrespective of the compaction strategy. You can 
> > > have a look at the PR of the new compaction strategy changes:
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > > gi
> > > th
> > > ub.com%2Fapache%2Fkafka%2Fpull%2F7528%2Ffiles&amp;data=02%7C01%7Cs
> > > en
> > > th
> > > ilm%40microsoft.com%7C9e3a2484adc54d48122408d767de70ab%7C72f988bf8
> > > 6f
> > > 14
> > > 1af91ab2d7cd011db47%7C1%7C0%7C637092077377837652&amp;sdata=j%2FxNb
> > > Jl
> > > oj
> > > YXIk8KdEe%2FIUrmy0iX6BPoNWUMM9rdjvd4%3D&amp;reserved=0
> > >
> > > #52 - sure, I have updated JIRA to include this details in the wiki.
> > >
> > > #53 - as I am pointed out in #51, the tombstone is abstract to 
> > > this change (i.e. the tombstone is handled within LogCleaner and 
> > > the compact strategy is by the offsetmap). this is what my 
> > > understand on the tombstone based on the code walk-thru... please 
> > > let me know if I am
> > missing anything here...
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Jun Rao <ju...@confluent.io>
> > > Sent: Thursday, November 7, 2019 4:32 PM
> > > To: dev <de...@kafka.apache.org>
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi, Senthil,
> > >
> > > Thanks for bringing back this KIP. Overall, this seems like a 
> > > useful feature. A few comments below.
> > >
> > > 50. One use case for the timestamp based compaction is to resolve 
> > > conflicts during data center failures. The failover of a data 
> > > center typically happens much longer tha millisec. So, timestamp 
> > > could be enough to determine the value to keep.
> > >
> > > 51. With the timestamp/header strategy, it seems that it may now 
> > > be possible that the last record could be removed during compaction.
> > > For example, if the active segment is empty, the last record in 
> > > the previous segment could be removed due to compaction. A new 
> > > replica then won't see the true end offset of the partition. If 
> > > that replica ever becomes the leader, it could write a different 
> > > record on the same end offset, which will be weird.
> > >
> > > 52. With the timestamp/header strategy, the behavior of the 
> > > application may need to change. In particular, the application 
> > > can't just blindly take the record with a larger offset and 
> > > assuming that it's
> > the value to keep.
> > > It needs to check the timestamp or the header now. So, it would be 
> > > useful to at least document this.
> > >
> > > 53. This also adds complexity for deletes. Currently, we use a 
> > > null payload to indicate a delete tombstone. The tombstone can be 
> > > removed once all previous records with the same key have been 
> > > removed. If the new strategies apply to tombstones, it's not clear 
> > > when a tombstone can be removed since subsequent records could 
> > > have timestamp/sequenceId smaller than that in the tombstone. It 
> > > would be useful to think this through and document the expected behavior.
> > >
> > > Jun
> > >
> > > On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy < 
> > > senthilm@microsoft.com.invalid> wrote:
> > >
> > > > Hi Guozhang,
> > > >
> > > > Sure and I have made a note in the JIRA item to make sure the 
> > > > wiki is updated.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Guozhang Wang <wa...@gmail.com>
> > > > Sent: Monday, November 4, 2019 11:00 AM
> > > > To: dev <de...@kafka.apache.org>
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hello Senthilnathan,
> > > >
> > > > Thanks for revamping on the KIP. I have only one comment about 
> > > > the wiki otherwise LGTM.
> > > >
> > > > 1. We should emphasize that the newly introduced config yields 
> > > > to the existing "log.cleanup.policy", i.e. if the latter's value 
> > > > is `delete` not `compact`, then the previous config would be ignored.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy < 
> > > > senthilm@microsoft.com.invalid> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I will start the vote thread shortly for this updated KIP. If 
> > > > > there are any more thoughts I would love to hear them.
> > > > >
> > > > > Thanks,
> > > > > Senthil
> > > > >
> > > > > -----Original Message-----
> > > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > > Sent: Thursday, October 31, 2019 3:51 AM
> > > > > To: dev@kafka.apache.org
> > > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > > >
> > > > > Hi Matthias
> > > > >
> > > > > Thanks for the response.
> > > > >
> > > > > (1) Yes
> > > > >
> > > > > (2) Yes, and the config name will be the same (i.e.
> > > > > `log.cleaner.compaction.strategy` &
> > > > > `log.cleaner.compaction.strategy.header`) at broker level and 
> > > > > topic level (to override broker level default compact strategy).
> > > > > Please let me know if we need to keep it in different naming
> > convention. Note:
> > > > > Broker level (which will be in the server.properties) 
> > > > > configuration is optional and default it to offset. Topic 
> > > > > level configuration will be default to broker level config...
> > > > >
> > > > > (3) By this new way, it avoids another config parameter and 
> > > > > also in feature if any new strategy like header need addition 
> > > > > info, no additional config required. As this got discussed 
> > > > > already and agreed to have separate config, I will revert it. KIP updated...
> > > > >
> > > > > (4) Done
> > > > >
> > > > > (5) Updated
> > > > >
> > > > > (6) Updated to pick the first header in the list
> > > > >
> > > > > Please let me know if you have any other questions.
> > > > >
> > > > > Thanks,
> > > > > Senthil
> > > > >
> > > > > -----Original Message-----
> > > > > From: Matthias J. Sax <ma...@confluent.io>
> > > > > Sent: Thursday, October 31, 2019 12:13 AM
> > > > > To: dev@kafka.apache.org
> > > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > > >
> > > > > Thanks for picking up this KIP, Senthil.
> > > > >
> > > > > (1) As far as I remember, the main issue of the original 
> > > > > proposal was a missing topic level configuration for the compaction strategy.
> > > > > With this being addressed, I am in favor of this KIP.
> > > > >
> > > > > (2) With regard to (1), it seems we would need a new topic 
> > > > > level config `compaction.strategy`, and 
> > > > > `log.cleaner.compaction.strategy` would be the default 
> > > > > strategy (ie, broker level config) if a topic does
> > > > not overwrite it?
> > > > >
> > > > > (3) Why did you remove 
> > > > > `log.cleaner.compaction.strategy.header`
> > > > > parameter and change the accepted values of 
> > > > > `log.cleaner.compaction.strategy` to "header.<key>" instead of 
> > > > > keeping "header"? The original approach seems to be cleaner, 
> > > > > and I think this was discussed on the original discuss thread already.
> > > > >
> > > > > (4) Nit: For the "timestamp" compaction strategy you changed 
> > > > > the KIP to
> > > > >
> > > > > -> `The record [create] timestamp`
> > > > >
> > > > > This is miss leading IMHO, because it depends on the 
> > > > > broker/log configuration `(log.)message.timestamp.type` that 
> > > > > can either be `CreateTime` or `LogAppendTime` what the actual record timestamp is.
> > > > > I would just remove "create" to keep it unspecified.
> > > > >
> > > > > (5) Nit: the section "Public Interfaces" should list the newly 
> > > > > introduced configs -- configuration parameters are a public
> > interface.
> > > > >
> > > > > (6) What do you mean by "first level header lookup"? The term 
> > > > > "first level" indicates some hierarchy, but headers don't have 
> > > > > any hierarchy
> > > > > -- it's just a list of key-value pairs? If you mean the 
> > > > > _order_ of the headers, ie, pick the first header in the list 
> > > > > that matches the key, please rephrase it to make it clearer.
> > > > >
> > > > >
> > > > >
> > > > > @Tom: I agree with all you are saying, however, I still think 
> > > > > that this KIP will improve the overall situation, because 
> > > > > everything you pointed out is actually true with offset based compaction, too.
> > > > >
> > > > > The KIP is not a silver bullet that solves all issue for 
> > > > > interleaved writes, but I personally believe, it's a good
> > improvement.
> > > > >
> > > > >
> > > > >
> > > > > -Matthias
> > > > >
> > > > >
> > > > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Please let me know if anyone has any questions on this 
> > > > > > updated
> > > > KIP-280...
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Senthil
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Senthilnathan Muthusamy 
> > > > > > <se...@microsoft.com.INVALID>
> > > > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > > > To: dev@kafka.apache.org
> > > > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > > > >
> > > > > > Hi Tom,
> > > > > >
> > > > > > Sorry for the delayed response.
> > > > > >
> > > > > > Regarding the fall back to offset decision for both 
> > > > > > timestamp & header
> > > > > value is based on the previous author discuss 
> > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2
> > > > > F%
> > > > > 2F
> > > > > li
> > > > > st
> > > > > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a45
> > > > > 7d
> > > > > bb
> > > > > cc
> > > > > dd
> > > > > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7
> > > > > C0
> > > > > 1%
> > > > > 7C
> > > > > se
> > > > > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72
> > > > > f9
> > > > > 88
> > > > > bf
> > > > > 86
> > > > > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=
> > > > > Wp
> > > > > EW
> > > > > 5y
> > > > > lu
> > > > > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > > > > and as per the discussion, it is really required to avoid duplicates.
> > > > > >
> > > > > > And the timestamp strategy is from the original KIP author 
> > > > > > and we are
> > > > > keeping it as is.
> > > > > >
> > > > > > Finally on the sequence order guarantee by the producer, it 
> > > > > > is not
> > > > > feasible on waiting for ack in async / multi-threads/processes 
> > > > > scenarios and hence the header sequence based compact strategy 
> > > > > with producer's responsibility to have a unique sequence 
> > > > > generation for the topic-partition-key level.
> > > > > >
> > > > > > Hoping this clarifies all your questions. Please let us know 
> > > > > > if you have
> > > > > any further questions.
> > > > > >
> > > > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a 
> > > > > > detail
> > > > > discussion on the original KIP with previous author and it 
> > > > > would great to hear your inputs as well.
> > > > > >
> > > > > > Thanks,
> > > > > > Senthil
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Tom Bentley <tb...@redhat.com>
> > > > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > > > To: dev@kafka.apache.org
> > > > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > > > >
> > > > > > Hi Senthilnathan,
> > > > > >
> > > > > > In the motivation isn't it a little misleading to say "On 
> > > > > > the producer side, we clearly preserve an order for the two 
> > > > > > messages, <K1, V1> <K1,
> > > > > > V2>"? IMHO, the semantics of the producer are clear that 
> > > > > > V2>having an observed
> > > > > > order of sending records from different producers is not 
> > > > > > sufficient to
> > > > > guarantee ordering on the broker. You really need to send the 
> > > > > 2nd record only after the 1st record is acked. It's the 
> > > > > difficultly of achieving that in practice that's the true motivation for your KIP.
> > > > > >
> > > > > > I can see the attraction of using timestamps, but it would 
> > > > > > be helpful to
> > > > > explain how that really solves the problem. When the producers 
> > > > > are in different processes on different machines you're 
> > > > > relying on their clocks being synchronized, which is a whole 
> > > > > subject in itself. Even if they're synchronized the resolution 
> > > > > of
> > > > > System.currentTimeMillis() is typically many milliseconds. If 
> > > > > your producers are in different threads of the same process 
> > > > > that could be a real problem because it
> > > > makes ties quite likely.
> > > > > > And you don't explain why it's OK to resolve ties using the offset.
> > > > > > The
> > > > > basis of your argument is that the offset is giving you the 
> > > > > wrong
> > > answer.
> > > > > > So it seems to me that using it as a tiebreaker is just 
> > > > > > narrowing the
> > > > > chances of getting the wrong answer. Maybe none of this 
> > > > > matters for your use case, but I think it should be spelled 
> > > > > out in the KIP, because it surely would matter for similar use cases.
> > > > > >
> > > > > > Using a sequence at least removes the problem of ties, but 
> > > > > > the
> > > > > interesting bit is now in how you deal with races between 
> > > > > threads/processes in getting a sequence number allocated 
> > > > > (which is out of scope of the KIP, I guess).
> > > > > > How is resolving that race any simpler that resolving the 
> > > > > > motivating
> > > > > race by waiting for the ack of the first record sent?
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > > Tom
> > > > > >
> > > > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > > > senthilm@microsoft.com.invalid> wrote:
> > > > > >
> > > > > >> Hi All,
> > > > > >>
> > > > > >> We are bring back the KIP-280 to live with small correct 
> > > > > >> for the discussion & voting. Thanks to previous author Luis 
> > > > > >> Cabral on the
> > > > > >> KIP-280 initiation and we are taking over to complete and 
> > > > > >> get it into
> > > > > 2.4...
> > > > > >>
> > > > > >> Below is the correction that we made to the existing KIP-280:
> > > > > >>
> > > > > >>   *   Allowing the compact strategy configuration at the topic
> > level
> > > > as
> > > > > >> the log compaction is at the topic level and a broker can 
> > > > > >> have multiple topics. This allows the flexibility to have 
> > > > > >> the strategy at both broker level (i.e. for all topics 
> > > > > >> within the
> > > > > >> broker) and topic level (i.e. for a subset of topics within 
> > > > > >> a
> > broker) as well...
> > > > > >>
> > > > > >> KIP-280:
> > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3
> > > > > >> A%
> > > > > >> 2F
> > > > > >> %2
> > > > > >> Fc
> > > > > >> wi
> > > > > >> k
> > > > > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%
> > > > > >> 2B
> > > > > >> En
> > > > > >> ha
> > > > > >> nc
> > > > > >> ed
> > > > > >> %
> > > > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.
> > > > > >> co
> > > > > >> m%
> > > > > >> 7C
> > > > > >> 68
> > > > > >> 6c
> > > > > >> 3
> > > > > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011d
> > > > > >> b4
> > > > > >> 7%
> > > > > >> 7C
> > > > > >> 1%
> > > > > >> 7C
> > > > > >> 0
> > > > > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9
> > > > > >> dk
> > > > > >> hC
> > > > > >> eA
> > > > > >> a7
> > > > > >> Gs
> > > > > >> 6
> > > > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3
> > > > > >> A%
> > > > > >> 2F
> > > > > >> %2
> > > > > >> Fg
> > > > > >> it
> > > > > >> h
> > > > > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Cse
> > > > > >> nt
> > > > > >> hi
> > > > > >> lm
> > > > > >> %4
> > > > > >> 0m
> > > > > >> i
> > > > > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86
> > > > > >> f1
> > > > > >> 41
> > > > > >> af
> > > > > >> 91
> > > > > >> ab
> > > > > >> 2
> > > > > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDU
> > > > > >> jJ
> > > > > >> jp
> > > > > >> Xo
> > > > > >> hE
> > > > > >> Wp
> > > > > >> t
> > > > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test 
> > > > > >> coverage in
> > > > > >> progress)
> > > > > >>
> > > > > >> Previous Thread DISCUSS:
> > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3
> > > > > >> A%
> > > > > >> 2F
> > > > > >> %2
> > > > > >> Fl
> > > > > >> is
> > > > > >> t
> > > > > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd806369
> > > > > >> 2a
> > > > > >> 53
> > > > > >> 5a
> > > > > >> 1a
> > > > > >> fa
> > > > > >> 5
> > > > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=0
> > > > > >> 2%
> > > > > >> 7C
> > > > > >> 01
> > > > > >> %7
> > > > > >> Cs
> > > > > >> e
> > > > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7
> > > > > >> C7
> > > > > >> 2f
> > > > > >> 98
> > > > > >> 8b
> > > > > >> f8
> > > > > >> 6
> > > > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sda
> > > > > >> ta
> > > > > >> =X
> > > > > >> wc
> > > > > >> UW
> > > > > >> WY
> > > > > >> D
> > > > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > > > >> Previous Thread VOTE:
> > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3
> > > > > >> A%
> > > > > >> 2F
> > > > > >> %2
> > > > > >> Fl
> > > > > >> is
> > > > > >> t
> > > > > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f7
> > > > > >> 65
> > > > > >> 05
> > > > > >> 83
> > > > > >> 49
> > > > > >> 78
> > > > > >> 1
> > > > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=0
> > > > > >> 2%
> > > > > >> 7C
> > > > > >> 01
> > > > > >> %7
> > > > > >> Cs
> > > > > >> e
> > > > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7
> > > > > >> C7
> > > > > >> 2f
> > > > > >> 98
> > > > > >> 8b
> > > > > >> f8
> > > > > >> 6
> > > > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sda
> > > > > >> ta
> > > > > >> =8
> > > > > >> cK
> > > > > >> Qc
> > > > > >> Am
> > > > > >> 2
> > > > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > > > >>
> > > > > >> Appreciate your timely action.
> > > > > >>
> > > > > >> PS: Initiating a separate thread as I was not able to reply 
> > > > > >> to the existing threads...
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Senthil
> > > > > >>
> > > > >
> > > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >

Re: [EXTERNAL] Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by radai <ra...@gmail.com>.

may I suggest that if, under "header" strategy, multiple records are
found with identical header values they are ALL kept?
this would be useful in cases where users send larger payloads than
max record size to kafka and are forced to fragment them - by setting
the same header in all fragments it would become possible to properly
log-compact topics with such fragmented payloads.

On Tue, Nov 26, 2019 at 10:24 PM Senthilnathan Muthusamy
<se...@microsoft.com.invalid> wrote:
>
> Thanks Jun for confirming!
>
> I have updated the KIP (added recommendation section and special case in handling LEO record for non-offset based compaction strategy). Please review and let me know if you have any other feedback.
>
> Regards,
> Senthil
>
> -----Original Message-----
> From: Jun Rao <ju...@confluent.io>
> Sent: Tuesday, November 26, 2019 4:36 PM
> To: dev <de...@kafka.apache.org>
> Subject: [EXTERNAL] Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hi, Senthil,
>
> Sorry for the delay.
>
> 51. It seems that we can just remove the last record from the batch, but keeps the batch during compaction. The batch level metadata is enough to preserve the log end offset.
>
> 53. Yes, your understanding is correct. So we could recommend users to set "
> max.compaction.lag.ms" properly if they care about deletes.
>
> Could you add both to the KIP?
>
> Thanks,
>
> Jun
>
>
> On Tue, Nov 26, 2019 at 5:09 AM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:
>
> > Hi Gouzhang & Jun,
> >
> > Can one of you please confirm/respond to the below mail so that I will
> > go ahead and update the KIP and proceed.
> >
> > Thanks
> > Senthil
> >
> > - Senthil
> > ________________________________
> > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > Sent: Wednesday, November 20, 2019 5:04:20 PM
> > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > <merging threads>
> >
> > Hi Gouzhang & Jun,
> >
> > Thanks for the detailed on the scenarios.
> >
> > #51 => thanks for the details Gouzhang with example. Does followers
> > won't be sync'ing LEO as well with leader? If yes, keeping last record
> > always (without compaction for non-offset scenarios) would work and
> > this needed only if the new strategy ends up removing LEO record,
> > right? Also I couldn’t able to retrieve Jason's mail related to
> > creating an empty message... Can you please forward if you have?
> > Wondering how that can solve this particular issue unless creating
> > record for random key that won't conflict with the producer/consumer keys for that topic/partition.
> >
> > #53 => I see that this can happen for the low produce rate from
> > remaining ineligible for compaction for an unbounded duration where by "
> > delete.retention.ms" triggers that removes the tombstone record. If
> > that's the case (please correct me if I am missing any other
> > scenarios), then we can suggest the Kafka users to have "segment.ms" & "
> > max.compaction.lag.ms" (as compaction won't happen on active segment)
> > to be smaller than the "delete.retention.ms" and that should address
> > this scenario, right?
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Jun Rao <ju...@confluent.io>
> > Sent: Wednesday, November 13, 2019 9:31 AM
> > To: dev <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi, Seth,
> >
> > 51. The difference is that with the offset compaction strategy, the
> > message corresponding to the last offset is always the winning record
> > and will never be removed. But with the new strategies, it's now
> > possible that the message corresponding to the last offset is a losing
> > record and needs to be removed.
> >
> > 53. Similarly, with the offset compaction strategy, if we see a
> > non-tombstone record after a tombstone record, the non-tombstone
> > record is always the winning one. However, with the new strategies,
> > that non-tombstone record with a larger offset could be a losing
> > record. The question is then how do we retain the tombstone long
> > enough so that we could still recognize that the non-tombstone record should be ignored.
> >
> > Thanks,
> >
> > Jun
> >
> > -----Original Message-----
> > From: Guozhang Wang <wa...@gmail.com>
> > Sent: Tuesday, November 12, 2019 6:09 PM
> > To: dev <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hello Senthil,
> >
> > Let me try to re-iterate on Jun's comments with some context here:
> >
> > 51: today with the offset-only compaction strategy, the last record of
> > the log (we call it the log-end-record, whose offset is
> > log-end-offset) would always be preserved and not compacted. This is
> > kinda important for replication since followers reason about the log-end-offset on the leader.
> > Consider this case: three replicas of a partition, leader 1 and
> > follower 2 and 3.
> >
> > Leader 1 has records a, b, c, d and d is the current last record of
> > the partition, the current log-end-offset is 3 (assuming record a's
> > offset is 0).
> > Follower 2 has replicated a, b, c, d. Log-end-offset is 3 Follower 3
> > has replicated a, b, c but not yet replicated d. Log-end-offset is 2.
> >
> > NOTE that the compaction triggering are independent on brokers, it is
> > possible that leader 1 triggers compaction and deletes record d, while
> > other followers have not triggered compaction yet. At this moment the
> > leader's log becomes a, b, c. Now let's say follower 3 fetch from
> > leader after the compaction, it will no longer see record d.
> >
> > Now suppose there's a leader migration and follower 3 becomes the new
> > leader, it would accept new appends (say, it's e), and record e would
> > be appended at *offset 3 *on new leader 3's log. But follower 2's
> > offset 3's record is d still. Later let's say follower 2 also triggers
> > compaction and also fetches the new record e from new leader 3:
> >
> > Follower 2's log would be* a(0), b(1), c(2), e(4)* where the numbers
> > in brackets are offset number; while leader 3's log would be *a(0),
> > b(1), c(2), e(3)*. Now you see the two logs diverges in offsets,
> > although their log entries are the same.
> >
> > -------------------------------------
> >
> > One way to resolve this, is to simply never remove the last message
> > during compaction. Another way (suggested by Jason in the old VOTE
> > thread) is to create an empty message batch to "take up" that offset slot.
> >
> >
> > 53: Again here's some context on when we can delete a tombstone (null):
> > during compaction, if we see the latest record for a certain key is a
> > tombstone we can remove all old records BUT that tombstone itself
> > cannot be removed immediately since the old records may already be
> > fetched by some consumers and that tombstone may not be fetched by
> > consumer yet. Also that tombstone may have not been replicated to all
> > other followers yet while the old records have already been
> > replicated. Hence we have some config on the broker to "delay" the
> > removal of the tombstone itself. You can find this config named
> > "delete.retention.ms" in
> >
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkafk
> > a.apache.org%2Fdocumentation%2F%23brokerconfigs&amp;data=02%7C01%7Csen
> > thilm%40microsoft.com%7C38cdb83b7c5e4ce6f94c08d772d1c32f%7C72f988bf86f
> > 141af91ab2d7cd011db47%7C1%7C0%7C637104117564048775&amp;sdata=g3UzPsobM
> > CTS7bqJKvuI7VkFQynhIcY7fOsT3%2FlJ5lg%3D&amp;reserved=0
> >
> > Now consider under timestamp / header based compaction strategy: a
> > later record may still be deprecated by an early tombstone, so if that
> > tombstone is already removed then the log compaction thread would not
> > remove that later record and hence the logic would be broken. That's
> > why we also need consider "delaying" the removal of the tombstone in this case.
> >
> > Personally I think we can still piggy-back on the "delete.retention.ms"
> > since its default value is 86400000ms == 1 day, and we just need to
> > document that if you have timestamp / header based compaction, then
> > it's YOUR responsibility as the Kafka user to make sure that the
> > timestamp / header out of ordering is smaller than the value of "delete.retention.ms".
> > Otherwise some later records with smaller timestamp / headers may not
> > be compacted correctly since the tombstone is already gone and hence
> > we do not have the "proof" to remove it anymore.
> >
> >
> > Does that make sense to you?
> >
> > Guozhang
> >
> >
> > On Tue, Nov 12, 2019 at 9:15 AM Senthilnathan Muthusamy <
> > senthilm@microsoft.com.invalid> wrote:
> >
> > > Hi Jun,
> > >
> > > Thanks for the response and please find below the response!
> > >
> > > #50 - got it...
> > >
> > > #51 - not sure how the last record will be deleted bcoz of this new
> > > compact strategy. The reason I am asking is, the compaction is based
> > > out of offsetmap and the new strategy logic is purely within the
> > > offsetmap... the offsetmap will always keep track of the latest
> > > offset irrespective of the compaction strategy. You can have a look
> > > at the PR of the new compaction strategy changes:
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> > > th
> > > ub.com%2Fapache%2Fkafka%2Fpull%2F7528%2Ffiles&amp;data=02%7C01%7Csen
> > > th
> > > ilm%40microsoft.com%7C9e3a2484adc54d48122408d767de70ab%7C72f988bf86f
> > > 14
> > > 1af91ab2d7cd011db47%7C1%7C0%7C637092077377837652&amp;sdata=j%2FxNbJl
> > > oj
> > > YXIk8KdEe%2FIUrmy0iX6BPoNWUMM9rdjvd4%3D&amp;reserved=0
> > >
> > > #52 - sure, I have updated JIRA to include this details in the wiki.
> > >
> > > #53 - as I am pointed out in #51, the tombstone is abstract to this
> > > change (i.e. the tombstone is handled within LogCleaner and the
> > > compact strategy is by the offsetmap). this is what my understand on
> > > the tombstone based on the code walk-thru... please let me know if I
> > > am
> > missing anything here...
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Jun Rao <ju...@confluent.io>
> > > Sent: Thursday, November 7, 2019 4:32 PM
> > > To: dev <de...@kafka.apache.org>
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi, Senthil,
> > >
> > > Thanks for bringing back this KIP. Overall, this seems like a useful
> > > feature. A few comments below.
> > >
> > > 50. One use case for the timestamp based compaction is to resolve
> > > conflicts during data center failures. The failover of a data center
> > > typically happens much longer tha millisec. So, timestamp could be
> > > enough to determine the value to keep.
> > >
> > > 51. With the timestamp/header strategy, it seems that it may now be
> > > possible that the last record could be removed during compaction.
> > > For example, if the active segment is empty, the last record in the
> > > previous segment could be removed due to compaction. A new replica
> > > then won't see the true end offset of the partition. If that replica
> > > ever becomes the leader, it could write a different record on the
> > > same end offset, which will be weird.
> > >
> > > 52. With the timestamp/header strategy, the behavior of the
> > > application may need to change. In particular, the application can't
> > > just blindly take the record with a larger offset and assuming that
> > > it's
> > the value to keep.
> > > It needs to check the timestamp or the header now. So, it would be
> > > useful to at least document this.
> > >
> > > 53. This also adds complexity for deletes. Currently, we use a null
> > > payload to indicate a delete tombstone. The tombstone can be removed
> > > once all previous records with the same key have been removed. If
> > > the new strategies apply to tombstones, it's not clear when a
> > > tombstone can be removed since subsequent records could have
> > > timestamp/sequenceId smaller than that in the tombstone. It would be
> > > useful to think this through and document the expected behavior.
> > >
> > > Jun
> > >
> > > On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy <
> > > senthilm@microsoft.com.invalid> wrote:
> > >
> > > > Hi Guozhang,
> > > >
> > > > Sure and I have made a note in the JIRA item to make sure the wiki
> > > > is updated.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Guozhang Wang <wa...@gmail.com>
> > > > Sent: Monday, November 4, 2019 11:00 AM
> > > > To: dev <de...@kafka.apache.org>
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hello Senthilnathan,
> > > >
> > > > Thanks for revamping on the KIP. I have only one comment about the
> > > > wiki otherwise LGTM.
> > > >
> > > > 1. We should emphasize that the newly introduced config yields to
> > > > the existing "log.cleanup.policy", i.e. if the latter's value is
> > > > `delete` not `compact`, then the previous config would be ignored.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy <
> > > > senthilm@microsoft.com.invalid> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I will start the vote thread shortly for this updated KIP. If
> > > > > there are any more thoughts I would love to hear them.
> > > > >
> > > > > Thanks,
> > > > > Senthil
> > > > >
> > > > > -----Original Message-----
> > > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > > Sent: Thursday, October 31, 2019 3:51 AM
> > > > > To: dev@kafka.apache.org
> > > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > > >
> > > > > Hi Matthias
> > > > >
> > > > > Thanks for the response.
> > > > >
> > > > > (1) Yes
> > > > >
> > > > > (2) Yes, and the config name will be the same (i.e.
> > > > > `log.cleaner.compaction.strategy` &
> > > > > `log.cleaner.compaction.strategy.header`) at broker level and
> > > > > topic level (to override broker level default compact strategy).
> > > > > Please let me know if we need to keep it in different naming
> > convention. Note:
> > > > > Broker level (which will be in the server.properties)
> > > > > configuration is optional and default it to offset. Topic level
> > > > > configuration will be default to broker level config...
> > > > >
> > > > > (3) By this new way, it avoids another config parameter and also
> > > > > in feature if any new strategy like header need addition info,
> > > > > no additional config required. As this got discussed already and
> > > > > agreed to have separate config, I will revert it. KIP updated...
> > > > >
> > > > > (4) Done
> > > > >
> > > > > (5) Updated
> > > > >
> > > > > (6) Updated to pick the first header in the list
> > > > >
> > > > > Please let me know if you have any other questions.
> > > > >
> > > > > Thanks,
> > > > > Senthil
> > > > >
> > > > > -----Original Message-----
> > > > > From: Matthias J. Sax <ma...@confluent.io>
> > > > > Sent: Thursday, October 31, 2019 12:13 AM
> > > > > To: dev@kafka.apache.org
> > > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > > >
> > > > > Thanks for picking up this KIP, Senthil.
> > > > >
> > > > > (1) As far as I remember, the main issue of the original
> > > > > proposal was a missing topic level configuration for the compaction strategy.
> > > > > With this being addressed, I am in favor of this KIP.
> > > > >
> > > > > (2) With regard to (1), it seems we would need a new topic level
> > > > > config `compaction.strategy`, and
> > > > > `log.cleaner.compaction.strategy` would be the default strategy
> > > > > (ie, broker level config) if a topic does
> > > > not overwrite it?
> > > > >
> > > > > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > > > > parameter and change the accepted values of
> > > > > `log.cleaner.compaction.strategy` to "header.<key>" instead of
> > > > > keeping "header"? The original approach seems to be cleaner, and
> > > > > I think this was discussed on the original discuss thread already.
> > > > >
> > > > > (4) Nit: For the "timestamp" compaction strategy you changed the
> > > > > KIP to
> > > > >
> > > > > -> `The record [create] timestamp`
> > > > >
> > > > > This is miss leading IMHO, because it depends on the broker/log
> > > > > configuration `(log.)message.timestamp.type` that can either be
> > > > > `CreateTime` or `LogAppendTime` what the actual record timestamp is.
> > > > > I would just remove "create" to keep it unspecified.
> > > > >
> > > > > (5) Nit: the section "Public Interfaces" should list the newly
> > > > > introduced configs -- configuration parameters are a public
> > interface.
> > > > >
> > > > > (6) What do you mean by "first level header lookup"? The term
> > > > > "first level" indicates some hierarchy, but headers don't have
> > > > > any hierarchy
> > > > > -- it's just a list of key-value pairs? If you mean the _order_
> > > > > of the headers, ie, pick the first header in the list that
> > > > > matches the key, please rephrase it to make it clearer.
> > > > >
> > > > >
> > > > >
> > > > > @Tom: I agree with all you are saying, however, I still think
> > > > > that this KIP will improve the overall situation, because
> > > > > everything you pointed out is actually true with offset based compaction, too.
> > > > >
> > > > > The KIP is not a silver bullet that solves all issue for
> > > > > interleaved writes, but I personally believe, it's a good
> > improvement.
> > > > >
> > > > >
> > > > >
> > > > > -Matthias
> > > > >
> > > > >
> > > > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Please let me know if anyone has any questions on this updated
> > > > KIP-280...
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Senthil
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > > > To: dev@kafka.apache.org
> > > > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > > > >
> > > > > > Hi Tom,
> > > > > >
> > > > > > Sorry for the delayed response.
> > > > > >
> > > > > > Regarding the fall back to offset decision for both timestamp
> > > > > > & header
> > > > > value is based on the previous author discuss
> > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%
> > > > > 2F
> > > > > li
> > > > > st
> > > > > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457d
> > > > > bb
> > > > > cc
> > > > > dd
> > > > > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C0
> > > > > 1%
> > > > > 7C
> > > > > se
> > > > > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f9
> > > > > 88
> > > > > bf
> > > > > 86
> > > > > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=Wp
> > > > > EW
> > > > > 5y
> > > > > lu
> > > > > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > > > > and as per the discussion, it is really required to avoid duplicates.
> > > > > >
> > > > > > And the timestamp strategy is from the original KIP author and
> > > > > > we are
> > > > > keeping it as is.
> > > > > >
> > > > > > Finally on the sequence order guarantee by the producer, it is
> > > > > > not
> > > > > feasible on waiting for ack in async / multi-threads/processes
> > > > > scenarios and hence the header sequence based compact strategy
> > > > > with producer's responsibility to have a unique sequence
> > > > > generation for the topic-partition-key level.
> > > > > >
> > > > > > Hoping this clarifies all your questions. Please let us know
> > > > > > if you have
> > > > > any further questions.
> > > > > >
> > > > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > > > > discussion on the original KIP with previous author and it would
> > > > > great to hear your inputs as well.
> > > > > >
> > > > > > Thanks,
> > > > > > Senthil
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Tom Bentley <tb...@redhat.com>
> > > > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > > > To: dev@kafka.apache.org
> > > > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > > > >
> > > > > > Hi Senthilnathan,
> > > > > >
> > > > > > In the motivation isn't it a little misleading to say "On the
> > > > > > producer side, we clearly preserve an order for the two
> > > > > > messages, <K1, V1> <K1,
> > > > > > V2>"? IMHO, the semantics of the producer are clear that
> > > > > > V2>having an observed
> > > > > > order of sending records from different producers is not
> > > > > > sufficient to
> > > > > guarantee ordering on the broker. You really need to send the
> > > > > 2nd record only after the 1st record is acked. It's the
> > > > > difficultly of achieving that in practice that's the true motivation for your KIP.
> > > > > >
> > > > > > I can see the attraction of using timestamps, but it would be
> > > > > > helpful to
> > > > > explain how that really solves the problem. When the producers
> > > > > are in different processes on different machines you're relying
> > > > > on their clocks being synchronized, which is a whole subject in
> > > > > itself. Even if they're synchronized the resolution of
> > > > > System.currentTimeMillis() is typically many milliseconds. If
> > > > > your producers are in different threads of the same process that
> > > > > could be a real problem because it
> > > > makes ties quite likely.
> > > > > > And you don't explain why it's OK to resolve ties using the offset.
> > > > > > The
> > > > > basis of your argument is that the offset is giving you the
> > > > > wrong
> > > answer.
> > > > > > So it seems to me that using it as a tiebreaker is just
> > > > > > narrowing the
> > > > > chances of getting the wrong answer. Maybe none of this matters
> > > > > for your use case, but I think it should be spelled out in the
> > > > > KIP, because it surely would matter for similar use cases.
> > > > > >
> > > > > > Using a sequence at least removes the problem of ties, but the
> > > > > interesting bit is now in how you deal with races between
> > > > > threads/processes in getting a sequence number allocated (which
> > > > > is out of scope of the KIP, I guess).
> > > > > > How is resolving that race any simpler that resolving the
> > > > > > motivating
> > > > > race by waiting for the ack of the first record sent?
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > > Tom
> > > > > >
> > > > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > > > senthilm@microsoft.com.invalid> wrote:
> > > > > >
> > > > > >> Hi All,
> > > > > >>
> > > > > >> We are bring back the KIP-280 to live with small correct for
> > > > > >> the discussion & voting. Thanks to previous author Luis
> > > > > >> Cabral on the
> > > > > >> KIP-280 initiation and we are taking over to complete and get
> > > > > >> it into
> > > > > 2.4...
> > > > > >>
> > > > > >> Below is the correction that we made to the existing KIP-280:
> > > > > >>
> > > > > >>   *   Allowing the compact strategy configuration at the topic
> > level
> > > > as
> > > > > >> the log compaction is at the topic level and a broker can
> > > > > >> have multiple topics. This allows the flexibility to have the
> > > > > >> strategy at both broker level (i.e. for all topics within the
> > > > > >> broker) and topic level (i.e. for a subset of topics within a
> > broker) as well...
> > > > > >>
> > > > > >> KIP-280:
> > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%
> > > > > >> 2F
> > > > > >> %2
> > > > > >> Fc
> > > > > >> wi
> > > > > >> k
> > > > > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2B
> > > > > >> En
> > > > > >> ha
> > > > > >> nc
> > > > > >> ed
> > > > > >> %
> > > > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.co
> > > > > >> m%
> > > > > >> 7C
> > > > > >> 68
> > > > > >> 6c
> > > > > >> 3
> > > > > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db4
> > > > > >> 7%
> > > > > >> 7C
> > > > > >> 1%
> > > > > >> 7C
> > > > > >> 0
> > > > > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dk
> > > > > >> hC
> > > > > >> eA
> > > > > >> a7
> > > > > >> Gs
> > > > > >> 6
> > > > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%
> > > > > >> 2F
> > > > > >> %2
> > > > > >> Fg
> > > > > >> it
> > > > > >> h
> > > > > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csent
> > > > > >> hi
> > > > > >> lm
> > > > > >> %4
> > > > > >> 0m
> > > > > >> i
> > > > > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f1
> > > > > >> 41
> > > > > >> af
> > > > > >> 91
> > > > > >> ab
> > > > > >> 2
> > > > > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJ
> > > > > >> jp
> > > > > >> Xo
> > > > > >> hE
> > > > > >> Wp
> > > > > >> t
> > > > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test
> > > > > >> coverage in
> > > > > >> progress)
> > > > > >>
> > > > > >> Previous Thread DISCUSS:
> > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%
> > > > > >> 2F
> > > > > >> %2
> > > > > >> Fl
> > > > > >> is
> > > > > >> t
> > > > > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a
> > > > > >> 53
> > > > > >> 5a
> > > > > >> 1a
> > > > > >> fa
> > > > > >> 5
> > > > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%
> > > > > >> 7C
> > > > > >> 01
> > > > > >> %7
> > > > > >> Cs
> > > > > >> e
> > > > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C7
> > > > > >> 2f
> > > > > >> 98
> > > > > >> 8b
> > > > > >> f8
> > > > > >> 6
> > > > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata
> > > > > >> =X
> > > > > >> wc
> > > > > >> UW
> > > > > >> WY
> > > > > >> D
> > > > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > > > >> Previous Thread VOTE:
> > > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%
> > > > > >> 2F
> > > > > >> %2
> > > > > >> Fl
> > > > > >> is
> > > > > >> t
> > > > > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f765
> > > > > >> 05
> > > > > >> 83
> > > > > >> 49
> > > > > >> 78
> > > > > >> 1
> > > > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%
> > > > > >> 7C
> > > > > >> 01
> > > > > >> %7
> > > > > >> Cs
> > > > > >> e
> > > > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C7
> > > > > >> 2f
> > > > > >> 98
> > > > > >> 8b
> > > > > >> f8
> > > > > >> 6
> > > > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata
> > > > > >> =8
> > > > > >> cK
> > > > > >> Qc
> > > > > >> Am
> > > > > >> 2
> > > > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > > > >>
> > > > > >> Appreciate your timely action.
> > > > > >>
> > > > > >> PS: Initiating a separate thread as I was not able to reply
> > > > > >> to the existing threads...
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Senthil
> > > > > >>
> > > > >
> > > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >

RE: [EXTERNAL] Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

Thanks Jun for confirming!

I have updated the KIP (added recommendation section and special case in handling LEO record for non-offset based compaction strategy). Please review and let me know if you have any other feedback.

Regards,
Senthil

-----Original Message-----
From: Jun Rao <ju...@confluent.io> 
Sent: Tuesday, November 26, 2019 4:36 PM
To: dev <de...@kafka.apache.org>
Subject: [EXTERNAL] Re: [DISCUSS] KIP-280: Enhanced log compaction

Hi, Senthil,

Sorry for the delay.

51. It seems that we can just remove the last record from the batch, but keeps the batch during compaction. The batch level metadata is enough to preserve the log end offset.

53. Yes, your understanding is correct. So we could recommend users to set "
max.compaction.lag.ms" properly if they care about deletes.

Could you add both to the KIP?

Thanks,

Jun


On Tue, Nov 26, 2019 at 5:09 AM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:

> Hi Gouzhang & Jun,
>
> Can one of you please confirm/respond to the below mail so that I will 
> go ahead and update the KIP and proceed.
>
> Thanks
> Senthil
>
> - Senthil
> ________________________________
> From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> Sent: Wednesday, November 20, 2019 5:04:20 PM
> To: dev@kafka.apache.org <de...@kafka.apache.org>
> Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
>
> <merging threads>
>
> Hi Gouzhang & Jun,
>
> Thanks for the detailed on the scenarios.
>
> #51 => thanks for the details Gouzhang with example. Does followers 
> won't be sync'ing LEO as well with leader? If yes, keeping last record 
> always (without compaction for non-offset scenarios) would work and 
> this needed only if the new strategy ends up removing LEO record, 
> right? Also I couldn’t able to retrieve Jason's mail related to 
> creating an empty message... Can you please forward if you have? 
> Wondering how that can solve this particular issue unless creating 
> record for random key that won't conflict with the producer/consumer keys for that topic/partition.
>
> #53 => I see that this can happen for the low produce rate from 
> remaining ineligible for compaction for an unbounded duration where by "
> delete.retention.ms" triggers that removes the tombstone record. If 
> that's the case (please correct me if I am missing any other 
> scenarios), then we can suggest the Kafka users to have "segment.ms" & "
> max.compaction.lag.ms" (as compaction won't happen on active segment) 
> to be smaller than the "delete.retention.ms" and that should address 
> this scenario, right?
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Jun Rao <ju...@confluent.io>
> Sent: Wednesday, November 13, 2019 9:31 AM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hi, Seth,
>
> 51. The difference is that with the offset compaction strategy, the 
> message corresponding to the last offset is always the winning record 
> and will never be removed. But with the new strategies, it's now 
> possible that the message corresponding to the last offset is a losing 
> record and needs to be removed.
>
> 53. Similarly, with the offset compaction strategy, if we see a 
> non-tombstone record after a tombstone record, the non-tombstone 
> record is always the winning one. However, with the new strategies, 
> that non-tombstone record with a larger offset could be a losing 
> record. The question is then how do we retain the tombstone long 
> enough so that we could still recognize that the non-tombstone record should be ignored.
>
> Thanks,
>
> Jun
>
> -----Original Message-----
> From: Guozhang Wang <wa...@gmail.com>
> Sent: Tuesday, November 12, 2019 6:09 PM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hello Senthil,
>
> Let me try to re-iterate on Jun's comments with some context here:
>
> 51: today with the offset-only compaction strategy, the last record of 
> the log (we call it the log-end-record, whose offset is 
> log-end-offset) would always be preserved and not compacted. This is 
> kinda important for replication since followers reason about the log-end-offset on the leader.
> Consider this case: three replicas of a partition, leader 1 and 
> follower 2 and 3.
>
> Leader 1 has records a, b, c, d and d is the current last record of 
> the partition, the current log-end-offset is 3 (assuming record a's 
> offset is 0).
> Follower 2 has replicated a, b, c, d. Log-end-offset is 3 Follower 3 
> has replicated a, b, c but not yet replicated d. Log-end-offset is 2.
>
> NOTE that the compaction triggering are independent on brokers, it is 
> possible that leader 1 triggers compaction and deletes record d, while 
> other followers have not triggered compaction yet. At this moment the 
> leader's log becomes a, b, c. Now let's say follower 3 fetch from 
> leader after the compaction, it will no longer see record d.
>
> Now suppose there's a leader migration and follower 3 becomes the new 
> leader, it would accept new appends (say, it's e), and record e would 
> be appended at *offset 3 *on new leader 3's log. But follower 2's 
> offset 3's record is d still. Later let's say follower 2 also triggers 
> compaction and also fetches the new record e from new leader 3:
>
> Follower 2's log would be* a(0), b(1), c(2), e(4)* where the numbers 
> in brackets are offset number; while leader 3's log would be *a(0), 
> b(1), c(2), e(3)*. Now you see the two logs diverges in offsets, 
> although their log entries are the same.
>
> -------------------------------------
>
> One way to resolve this, is to simply never remove the last message 
> during compaction. Another way (suggested by Jason in the old VOTE 
> thread) is to create an empty message batch to "take up" that offset slot.
>
>
> 53: Again here's some context on when we can delete a tombstone (null):
> during compaction, if we see the latest record for a certain key is a 
> tombstone we can remove all old records BUT that tombstone itself 
> cannot be removed immediately since the old records may already be 
> fetched by some consumers and that tombstone may not be fetched by 
> consumer yet. Also that tombstone may have not been replicated to all 
> other followers yet while the old records have already been 
> replicated. Hence we have some config on the broker to "delay" the 
> removal of the tombstone itself. You can find this config named 
> "delete.retention.ms" in
>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkafk
> a.apache.org%2Fdocumentation%2F%23brokerconfigs&amp;data=02%7C01%7Csen
> thilm%40microsoft.com%7C38cdb83b7c5e4ce6f94c08d772d1c32f%7C72f988bf86f
> 141af91ab2d7cd011db47%7C1%7C0%7C637104117564048775&amp;sdata=g3UzPsobM
> CTS7bqJKvuI7VkFQynhIcY7fOsT3%2FlJ5lg%3D&amp;reserved=0
>
> Now consider under timestamp / header based compaction strategy: a 
> later record may still be deprecated by an early tombstone, so if that 
> tombstone is already removed then the log compaction thread would not 
> remove that later record and hence the logic would be broken. That's 
> why we also need consider "delaying" the removal of the tombstone in this case.
>
> Personally I think we can still piggy-back on the "delete.retention.ms"
> since its default value is 86400000ms == 1 day, and we just need to 
> document that if you have timestamp / header based compaction, then 
> it's YOUR responsibility as the Kafka user to make sure that the 
> timestamp / header out of ordering is smaller than the value of "delete.retention.ms".
> Otherwise some later records with smaller timestamp / headers may not 
> be compacted correctly since the tombstone is already gone and hence 
> we do not have the "proof" to remove it anymore.
>
>
> Does that make sense to you?
>
> Guozhang
>
>
> On Tue, Nov 12, 2019 at 9:15 AM Senthilnathan Muthusamy < 
> senthilm@microsoft.com.invalid> wrote:
>
> > Hi Jun,
> >
> > Thanks for the response and please find below the response!
> >
> > #50 - got it...
> >
> > #51 - not sure how the last record will be deleted bcoz of this new 
> > compact strategy. The reason I am asking is, the compaction is based 
> > out of offsetmap and the new strategy logic is purely within the 
> > offsetmap... the offsetmap will always keep track of the latest 
> > offset irrespective of the compaction strategy. You can have a look 
> > at the PR of the new compaction strategy changes:
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> > th 
> > ub.com%2Fapache%2Fkafka%2Fpull%2F7528%2Ffiles&amp;data=02%7C01%7Csen
> > th
> > ilm%40microsoft.com%7C9e3a2484adc54d48122408d767de70ab%7C72f988bf86f
> > 14 
> > 1af91ab2d7cd011db47%7C1%7C0%7C637092077377837652&amp;sdata=j%2FxNbJl
> > oj
> > YXIk8KdEe%2FIUrmy0iX6BPoNWUMM9rdjvd4%3D&amp;reserved=0
> >
> > #52 - sure, I have updated JIRA to include this details in the wiki.
> >
> > #53 - as I am pointed out in #51, the tombstone is abstract to this 
> > change (i.e. the tombstone is handled within LogCleaner and the 
> > compact strategy is by the offsetmap). this is what my understand on 
> > the tombstone based on the code walk-thru... please let me know if I 
> > am
> missing anything here...
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Jun Rao <ju...@confluent.io>
> > Sent: Thursday, November 7, 2019 4:32 PM
> > To: dev <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi, Senthil,
> >
> > Thanks for bringing back this KIP. Overall, this seems like a useful 
> > feature. A few comments below.
> >
> > 50. One use case for the timestamp based compaction is to resolve 
> > conflicts during data center failures. The failover of a data center 
> > typically happens much longer tha millisec. So, timestamp could be 
> > enough to determine the value to keep.
> >
> > 51. With the timestamp/header strategy, it seems that it may now be 
> > possible that the last record could be removed during compaction. 
> > For example, if the active segment is empty, the last record in the 
> > previous segment could be removed due to compaction. A new replica 
> > then won't see the true end offset of the partition. If that replica 
> > ever becomes the leader, it could write a different record on the 
> > same end offset, which will be weird.
> >
> > 52. With the timestamp/header strategy, the behavior of the 
> > application may need to change. In particular, the application can't 
> > just blindly take the record with a larger offset and assuming that 
> > it's
> the value to keep.
> > It needs to check the timestamp or the header now. So, it would be 
> > useful to at least document this.
> >
> > 53. This also adds complexity for deletes. Currently, we use a null 
> > payload to indicate a delete tombstone. The tombstone can be removed 
> > once all previous records with the same key have been removed. If 
> > the new strategies apply to tombstones, it's not clear when a 
> > tombstone can be removed since subsequent records could have 
> > timestamp/sequenceId smaller than that in the tombstone. It would be 
> > useful to think this through and document the expected behavior.
> >
> > Jun
> >
> > On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy < 
> > senthilm@microsoft.com.invalid> wrote:
> >
> > > Hi Guozhang,
> > >
> > > Sure and I have made a note in the JIRA item to make sure the wiki 
> > > is updated.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Guozhang Wang <wa...@gmail.com>
> > > Sent: Monday, November 4, 2019 11:00 AM
> > > To: dev <de...@kafka.apache.org>
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hello Senthilnathan,
> > >
> > > Thanks for revamping on the KIP. I have only one comment about the 
> > > wiki otherwise LGTM.
> > >
> > > 1. We should emphasize that the newly introduced config yields to 
> > > the existing "log.cleanup.policy", i.e. if the latter's value is 
> > > `delete` not `compact`, then the previous config would be ignored.
> > >
> > >
> > > Guozhang
> > >
> > > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy < 
> > > senthilm@microsoft.com.invalid> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I will start the vote thread shortly for this updated KIP. If 
> > > > there are any more thoughts I would love to hear them.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > Sent: Thursday, October 31, 2019 3:51 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Matthias
> > > >
> > > > Thanks for the response.
> > > >
> > > > (1) Yes
> > > >
> > > > (2) Yes, and the config name will be the same (i.e.
> > > > `log.cleaner.compaction.strategy` &
> > > > `log.cleaner.compaction.strategy.header`) at broker level and 
> > > > topic level (to override broker level default compact strategy).
> > > > Please let me know if we need to keep it in different naming
> convention. Note:
> > > > Broker level (which will be in the server.properties) 
> > > > configuration is optional and default it to offset. Topic level 
> > > > configuration will be default to broker level config...
> > > >
> > > > (3) By this new way, it avoids another config parameter and also 
> > > > in feature if any new strategy like header need addition info, 
> > > > no additional config required. As this got discussed already and 
> > > > agreed to have separate config, I will revert it. KIP updated...
> > > >
> > > > (4) Done
> > > >
> > > > (5) Updated
> > > >
> > > > (6) Updated to pick the first header in the list
> > > >
> > > > Please let me know if you have any other questions.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Matthias J. Sax <ma...@confluent.io>
> > > > Sent: Thursday, October 31, 2019 12:13 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Thanks for picking up this KIP, Senthil.
> > > >
> > > > (1) As far as I remember, the main issue of the original 
> > > > proposal was a missing topic level configuration for the compaction strategy.
> > > > With this being addressed, I am in favor of this KIP.
> > > >
> > > > (2) With regard to (1), it seems we would need a new topic level 
> > > > config `compaction.strategy`, and 
> > > > `log.cleaner.compaction.strategy` would be the default strategy 
> > > > (ie, broker level config) if a topic does
> > > not overwrite it?
> > > >
> > > > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > > > parameter and change the accepted values of 
> > > > `log.cleaner.compaction.strategy` to "header.<key>" instead of 
> > > > keeping "header"? The original approach seems to be cleaner, and 
> > > > I think this was discussed on the original discuss thread already.
> > > >
> > > > (4) Nit: For the "timestamp" compaction strategy you changed the 
> > > > KIP to
> > > >
> > > > -> `The record [create] timestamp`
> > > >
> > > > This is miss leading IMHO, because it depends on the broker/log 
> > > > configuration `(log.)message.timestamp.type` that can either be 
> > > > `CreateTime` or `LogAppendTime` what the actual record timestamp is.
> > > > I would just remove "create" to keep it unspecified.
> > > >
> > > > (5) Nit: the section "Public Interfaces" should list the newly 
> > > > introduced configs -- configuration parameters are a public
> interface.
> > > >
> > > > (6) What do you mean by "first level header lookup"? The term 
> > > > "first level" indicates some hierarchy, but headers don't have 
> > > > any hierarchy
> > > > -- it's just a list of key-value pairs? If you mean the _order_ 
> > > > of the headers, ie, pick the first header in the list that 
> > > > matches the key, please rephrase it to make it clearer.
> > > >
> > > >
> > > >
> > > > @Tom: I agree with all you are saying, however, I still think 
> > > > that this KIP will improve the overall situation, because 
> > > > everything you pointed out is actually true with offset based compaction, too.
> > > >
> > > > The KIP is not a silver bullet that solves all issue for 
> > > > interleaved writes, but I personally believe, it's a good
> improvement.
> > > >
> > > >
> > > >
> > > > -Matthias
> > > >
> > > >
> > > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > > Hi,
> > > > >
> > > > > Please let me know if anyone has any questions on this updated
> > > KIP-280...
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Senthil
> > > > >
> > > > > -----Original Message-----
> > > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > > To: dev@kafka.apache.org
> > > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > > >
> > > > > Hi Tom,
> > > > >
> > > > > Sorry for the delayed response.
> > > > >
> > > > > Regarding the fall back to offset decision for both timestamp 
> > > > > & header
> > > > value is based on the previous author discuss 
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%
> > > > 2F
> > > > li
> > > > st
> > > > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457d
> > > > bb
> > > > cc
> > > > dd
> > > > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C0
> > > > 1%
> > > > 7C
> > > > se
> > > > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f9
> > > > 88
> > > > bf
> > > > 86
> > > > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=Wp
> > > > EW
> > > > 5y
> > > > lu
> > > > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > > > and as per the discussion, it is really required to avoid duplicates.
> > > > >
> > > > > And the timestamp strategy is from the original KIP author and 
> > > > > we are
> > > > keeping it as is.
> > > > >
> > > > > Finally on the sequence order guarantee by the producer, it is 
> > > > > not
> > > > feasible on waiting for ack in async / multi-threads/processes 
> > > > scenarios and hence the header sequence based compact strategy 
> > > > with producer's responsibility to have a unique sequence 
> > > > generation for the topic-partition-key level.
> > > > >
> > > > > Hoping this clarifies all your questions. Please let us know 
> > > > > if you have
> > > > any further questions.
> > > > >
> > > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > > > discussion on the original KIP with previous author and it would 
> > > > great to hear your inputs as well.
> > > > >
> > > > > Thanks,
> > > > > Senthil
> > > > >
> > > > > -----Original Message-----
> > > > > From: Tom Bentley <tb...@redhat.com>
> > > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > > To: dev@kafka.apache.org
> > > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > > >
> > > > > Hi Senthilnathan,
> > > > >
> > > > > In the motivation isn't it a little misleading to say "On the 
> > > > > producer side, we clearly preserve an order for the two 
> > > > > messages, <K1, V1> <K1,
> > > > > V2>"? IMHO, the semantics of the producer are clear that 
> > > > > V2>having an observed
> > > > > order of sending records from different producers is not 
> > > > > sufficient to
> > > > guarantee ordering on the broker. You really need to send the 
> > > > 2nd record only after the 1st record is acked. It's the 
> > > > difficultly of achieving that in practice that's the true motivation for your KIP.
> > > > >
> > > > > I can see the attraction of using timestamps, but it would be 
> > > > > helpful to
> > > > explain how that really solves the problem. When the producers 
> > > > are in different processes on different machines you're relying 
> > > > on their clocks being synchronized, which is a whole subject in 
> > > > itself. Even if they're synchronized the resolution of
> > > > System.currentTimeMillis() is typically many milliseconds. If 
> > > > your producers are in different threads of the same process that 
> > > > could be a real problem because it
> > > makes ties quite likely.
> > > > > And you don't explain why it's OK to resolve ties using the offset.
> > > > > The
> > > > basis of your argument is that the offset is giving you the 
> > > > wrong
> > answer.
> > > > > So it seems to me that using it as a tiebreaker is just 
> > > > > narrowing the
> > > > chances of getting the wrong answer. Maybe none of this matters 
> > > > for your use case, but I think it should be spelled out in the 
> > > > KIP, because it surely would matter for similar use cases.
> > > > >
> > > > > Using a sequence at least removes the problem of ties, but the
> > > > interesting bit is now in how you deal with races between 
> > > > threads/processes in getting a sequence number allocated (which 
> > > > is out of scope of the KIP, I guess).
> > > > > How is resolving that race any simpler that resolving the 
> > > > > motivating
> > > > race by waiting for the ack of the first record sent?
> > > > >
> > > > > Kind regards,
> > > > >
> > > > > Tom
> > > > >
> > > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > > senthilm@microsoft.com.invalid> wrote:
> > > > >
> > > > >> Hi All,
> > > > >>
> > > > >> We are bring back the KIP-280 to live with small correct for 
> > > > >> the discussion & voting. Thanks to previous author Luis 
> > > > >> Cabral on the
> > > > >> KIP-280 initiation and we are taking over to complete and get 
> > > > >> it into
> > > > 2.4...
> > > > >>
> > > > >> Below is the correction that we made to the existing KIP-280:
> > > > >>
> > > > >>   *   Allowing the compact strategy configuration at the topic
> level
> > > as
> > > > >> the log compaction is at the topic level and a broker can 
> > > > >> have multiple topics. This allows the flexibility to have the 
> > > > >> strategy at both broker level (i.e. for all topics within the
> > > > >> broker) and topic level (i.e. for a subset of topics within a
> broker) as well...
> > > > >>
> > > > >> KIP-280:
> > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%
> > > > >> 2F
> > > > >> %2
> > > > >> Fc
> > > > >> wi
> > > > >> k
> > > > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2B
> > > > >> En
> > > > >> ha
> > > > >> nc
> > > > >> ed
> > > > >> %
> > > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.co
> > > > >> m%
> > > > >> 7C
> > > > >> 68
> > > > >> 6c
> > > > >> 3
> > > > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db4
> > > > >> 7%
> > > > >> 7C
> > > > >> 1%
> > > > >> 7C
> > > > >> 0
> > > > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dk
> > > > >> hC
> > > > >> eA
> > > > >> a7
> > > > >> Gs
> > > > >> 6
> > > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%
> > > > >> 2F
> > > > >> %2
> > > > >> Fg
> > > > >> it
> > > > >> h
> > > > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csent
> > > > >> hi
> > > > >> lm
> > > > >> %4
> > > > >> 0m
> > > > >> i
> > > > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f1
> > > > >> 41
> > > > >> af
> > > > >> 91
> > > > >> ab
> > > > >> 2
> > > > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJ
> > > > >> jp
> > > > >> Xo
> > > > >> hE
> > > > >> Wp
> > > > >> t
> > > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test 
> > > > >> coverage in
> > > > >> progress)
> > > > >>
> > > > >> Previous Thread DISCUSS:
> > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%
> > > > >> 2F
> > > > >> %2
> > > > >> Fl
> > > > >> is
> > > > >> t
> > > > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a
> > > > >> 53
> > > > >> 5a
> > > > >> 1a
> > > > >> fa
> > > > >> 5
> > > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%
> > > > >> 7C
> > > > >> 01
> > > > >> %7
> > > > >> Cs
> > > > >> e
> > > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C7
> > > > >> 2f
> > > > >> 98
> > > > >> 8b
> > > > >> f8
> > > > >> 6
> > > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata
> > > > >> =X
> > > > >> wc
> > > > >> UW
> > > > >> WY
> > > > >> D
> > > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > > >> Previous Thread VOTE:
> > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%
> > > > >> 2F
> > > > >> %2
> > > > >> Fl
> > > > >> is
> > > > >> t
> > > > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f765
> > > > >> 05
> > > > >> 83
> > > > >> 49
> > > > >> 78
> > > > >> 1
> > > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%
> > > > >> 7C
> > > > >> 01
> > > > >> %7
> > > > >> Cs
> > > > >> e
> > > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C7
> > > > >> 2f
> > > > >> 98
> > > > >> 8b
> > > > >> f8
> > > > >> 6
> > > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata
> > > > >> =8
> > > > >> cK
> > > > >> Qc
> > > > >> Am
> > > > >> 2
> > > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > > >>
> > > > >> Appreciate your timely action.
> > > > >>
> > > > >> PS: Initiating a separate thread as I was not able to reply 
> > > > >> to the existing threads...
> > > > >>
> > > > >> Thanks,
> > > > >> Senthil
> > > > >>
> > > >
> > > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Jun Rao <ju...@confluent.io>.

Hi, Senthil,

Sorry for the delay.

51. It seems that we can just remove the last record from the batch, but
keeps the batch during compaction. The batch level metadata is enough to
preserve the log end offset.

53. Yes, your understanding is correct. So we could recommend users to set "
max.compaction.lag.ms" properly if they care about deletes.

Could you add both to the KIP?

Thanks,

Jun


On Tue, Nov 26, 2019 at 5:09 AM Senthilnathan Muthusamy
<se...@microsoft.com.invalid> wrote:

> Hi Gouzhang & Jun,
>
> Can one of you please confirm/respond to the below mail so that I will go
> ahead and update the KIP and proceed.
>
> Thanks
> Senthil
>
> - Senthil
> ________________________________
> From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> Sent: Wednesday, November 20, 2019 5:04:20 PM
> To: dev@kafka.apache.org <de...@kafka.apache.org>
> Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
>
> <merging threads>
>
> Hi Gouzhang & Jun,
>
> Thanks for the detailed on the scenarios.
>
> #51 => thanks for the details Gouzhang with example. Does followers won't
> be sync'ing LEO as well with leader? If yes, keeping last record always
> (without compaction for non-offset scenarios) would work and this needed
> only if the new strategy ends up removing LEO record, right? Also I
> couldn’t able to retrieve Jason's mail related to creating an empty
> message... Can you please forward if you have? Wondering how that can solve
> this particular issue unless creating record for random key that won't
> conflict with the producer/consumer keys for that topic/partition.
>
> #53 => I see that this can happen for the low produce rate from remaining
> ineligible for compaction for an unbounded duration where by "
> delete.retention.ms" triggers that removes the tombstone record. If
> that's the case (please correct me if I am missing any other scenarios),
> then we can suggest the Kafka users to have "segment.ms" & "
> max.compaction.lag.ms" (as compaction won't happen on active segment) to
> be smaller than the "delete.retention.ms" and that should address this
> scenario, right?
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Jun Rao <ju...@confluent.io>
> Sent: Wednesday, November 13, 2019 9:31 AM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hi, Seth,
>
> 51. The difference is that with the offset compaction strategy, the
> message corresponding to the last offset is always the winning record and
> will never be removed. But with the new strategies, it's now possible that
> the message corresponding to the last offset is a losing record and needs
> to be removed.
>
> 53. Similarly, with the offset compaction strategy, if we see a
> non-tombstone record after a tombstone record, the non-tombstone record is
> always the winning one. However, with the new strategies, that
> non-tombstone record with a larger offset could be a losing record. The
> question is then how do we retain the tombstone long enough so that we
> could still recognize that the non-tombstone record should be ignored.
>
> Thanks,
>
> Jun
>
> -----Original Message-----
> From: Guozhang Wang <wa...@gmail.com>
> Sent: Tuesday, November 12, 2019 6:09 PM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hello Senthil,
>
> Let me try to re-iterate on Jun's comments with some context here:
>
> 51: today with the offset-only compaction strategy, the last record of the
> log (we call it the log-end-record, whose offset is log-end-offset) would
> always be preserved and not compacted. This is kinda important for
> replication since followers reason about the log-end-offset on the leader.
> Consider this case: three replicas of a partition, leader 1 and follower 2
> and 3.
>
> Leader 1 has records a, b, c, d and d is the current last record of the
> partition, the current log-end-offset is 3 (assuming record a's offset is
> 0).
> Follower 2 has replicated a, b, c, d. Log-end-offset is 3 Follower 3 has
> replicated a, b, c but not yet replicated d. Log-end-offset is 2.
>
> NOTE that the compaction triggering are independent on brokers, it is
> possible that leader 1 triggers compaction and deletes record d, while
> other followers have not triggered compaction yet. At this moment the
> leader's log becomes a, b, c. Now let's say follower 3 fetch from leader
> after the compaction, it will no longer see record d.
>
> Now suppose there's a leader migration and follower 3 becomes the new
> leader, it would accept new appends (say, it's e), and record e would be
> appended at *offset 3 *on new leader 3's log. But follower 2's offset 3's
> record is d still. Later let's say follower 2 also triggers compaction and
> also fetches the new record e from new leader 3:
>
> Follower 2's log would be* a(0), b(1), c(2), e(4)* where the numbers in
> brackets are offset number; while leader 3's log would be *a(0), b(1),
> c(2), e(3)*. Now you see the two logs diverges in offsets, although their
> log entries are the same.
>
> -------------------------------------
>
> One way to resolve this, is to simply never remove the last message during
> compaction. Another way (suggested by Jason in the old VOTE thread) is to
> create an empty message batch to "take up" that offset slot.
>
>
> 53: Again here's some context on when we can delete a tombstone (null):
> during compaction, if we see the latest record for a certain key is a
> tombstone we can remove all old records BUT that tombstone itself cannot be
> removed immediately since the old records may already be fetched by some
> consumers and that tombstone may not be fetched by consumer yet. Also that
> tombstone may have not been replicated to all other followers yet while the
> old records have already been replicated. Hence we have some config on the
> broker to "delay" the removal of the tombstone itself. You can find this
> config named "delete.retention.ms" in
>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkafka.apache.org%2Fdocumentation%2F%23brokerconfigs&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C8c8b5ff80fe943faa48808d76e1ec617%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637098950758809483&amp;sdata=uMmCGtBbnOstnDTATUHrGI7quoZ7%2F4Chegmovza5MhQ%3D&amp;reserved=0
>
> Now consider under timestamp / header based compaction strategy: a later
> record may still be deprecated by an early tombstone, so if that tombstone
> is already removed then the log compaction thread would not remove that
> later record and hence the logic would be broken. That's why we also need
> consider "delaying" the removal of the tombstone in this case.
>
> Personally I think we can still piggy-back on the "delete.retention.ms"
> since its default value is 86400000ms == 1 day, and we just need to
> document that if you have timestamp / header based compaction, then it's
> YOUR responsibility as the Kafka user to make sure that the timestamp /
> header out of ordering is smaller than the value of "delete.retention.ms".
> Otherwise some later records with smaller timestamp / headers may not be
> compacted correctly since the tombstone is already gone and hence we do not
> have the "proof" to remove it anymore.
>
>
> Does that make sense to you?
>
> Guozhang
>
>
> On Tue, Nov 12, 2019 at 9:15 AM Senthilnathan Muthusamy <
> senthilm@microsoft.com.invalid> wrote:
>
> > Hi Jun,
> >
> > Thanks for the response and please find below the response!
> >
> > #50 - got it...
> >
> > #51 - not sure how the last record will be deleted bcoz of this new
> > compact strategy. The reason I am asking is, the compaction is based
> > out of offsetmap and the new strategy logic is purely within the
> > offsetmap... the offsetmap will always keep track of the latest offset
> > irrespective of the compaction strategy. You can have a look at the PR
> > of the new compaction strategy changes:
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > ub.com%2Fapache%2Fkafka%2Fpull%2F7528%2Ffiles&amp;data=02%7C01%7Csenth
> > ilm%40microsoft.com%7C9e3a2484adc54d48122408d767de70ab%7C72f988bf86f14
> > 1af91ab2d7cd011db47%7C1%7C0%7C637092077377837652&amp;sdata=j%2FxNbJloj
> > YXIk8KdEe%2FIUrmy0iX6BPoNWUMM9rdjvd4%3D&amp;reserved=0
> >
> > #52 - sure, I have updated JIRA to include this details in the wiki.
> >
> > #53 - as I am pointed out in #51, the tombstone is abstract to this
> > change (i.e. the tombstone is handled within LogCleaner and the
> > compact strategy is by the offsetmap). this is what my understand on
> > the tombstone based on the code walk-thru... please let me know if I am
> missing anything here...
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Jun Rao <ju...@confluent.io>
> > Sent: Thursday, November 7, 2019 4:32 PM
> > To: dev <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi, Senthil,
> >
> > Thanks for bringing back this KIP. Overall, this seems like a useful
> > feature. A few comments below.
> >
> > 50. One use case for the timestamp based compaction is to resolve
> > conflicts during data center failures. The failover of a data center
> > typically happens much longer tha millisec. So, timestamp could be
> > enough to determine the value to keep.
> >
> > 51. With the timestamp/header strategy, it seems that it may now be
> > possible that the last record could be removed during compaction. For
> > example, if the active segment is empty, the last record in the
> > previous segment could be removed due to compaction. A new replica
> > then won't see the true end offset of the partition. If that replica
> > ever becomes the leader, it could write a different record on the same
> > end offset, which will be weird.
> >
> > 52. With the timestamp/header strategy, the behavior of the
> > application may need to change. In particular, the application can't
> > just blindly take the record with a larger offset and assuming that it's
> the value to keep.
> > It needs to check the timestamp or the header now. So, it would be
> > useful to at least document this.
> >
> > 53. This also adds complexity for deletes. Currently, we use a null
> > payload to indicate a delete tombstone. The tombstone can be removed
> > once all previous records with the same key have been removed. If the
> > new strategies apply to tombstones, it's not clear when a tombstone
> > can be removed since subsequent records could have
> > timestamp/sequenceId smaller than that in the tombstone. It would be
> > useful to think this through and document the expected behavior.
> >
> > Jun
> >
> > On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy <
> > senthilm@microsoft.com.invalid> wrote:
> >
> > > Hi Guozhang,
> > >
> > > Sure and I have made a note in the JIRA item to make sure the wiki
> > > is updated.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Guozhang Wang <wa...@gmail.com>
> > > Sent: Monday, November 4, 2019 11:00 AM
> > > To: dev <de...@kafka.apache.org>
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hello Senthilnathan,
> > >
> > > Thanks for revamping on the KIP. I have only one comment about the
> > > wiki otherwise LGTM.
> > >
> > > 1. We should emphasize that the newly introduced config yields to
> > > the existing "log.cleanup.policy", i.e. if the latter's value is
> > > `delete` not `compact`, then the previous config would be ignored.
> > >
> > >
> > > Guozhang
> > >
> > > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy <
> > > senthilm@microsoft.com.invalid> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I will start the vote thread shortly for this updated KIP. If
> > > > there are any more thoughts I would love to hear them.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > Sent: Thursday, October 31, 2019 3:51 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Matthias
> > > >
> > > > Thanks for the response.
> > > >
> > > > (1) Yes
> > > >
> > > > (2) Yes, and the config name will be the same (i.e.
> > > > `log.cleaner.compaction.strategy` &
> > > > `log.cleaner.compaction.strategy.header`) at broker level and
> > > > topic level (to override broker level default compact strategy).
> > > > Please let me know if we need to keep it in different naming
> convention. Note:
> > > > Broker level (which will be in the server.properties)
> > > > configuration is optional and default it to offset. Topic level
> > > > configuration will be default to broker level config...
> > > >
> > > > (3) By this new way, it avoids another config parameter and also
> > > > in feature if any new strategy like header need addition info, no
> > > > additional config required. As this got discussed already and
> > > > agreed to have separate config, I will revert it. KIP updated...
> > > >
> > > > (4) Done
> > > >
> > > > (5) Updated
> > > >
> > > > (6) Updated to pick the first header in the list
> > > >
> > > > Please let me know if you have any other questions.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Matthias J. Sax <ma...@confluent.io>
> > > > Sent: Thursday, October 31, 2019 12:13 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Thanks for picking up this KIP, Senthil.
> > > >
> > > > (1) As far as I remember, the main issue of the original proposal
> > > > was a missing topic level configuration for the compaction strategy.
> > > > With this being addressed, I am in favor of this KIP.
> > > >
> > > > (2) With regard to (1), it seems we would need a new topic level
> > > > config `compaction.strategy`, and
> > > > `log.cleaner.compaction.strategy` would be the default strategy
> > > > (ie, broker level config) if a topic does
> > > not overwrite it?
> > > >
> > > > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > > > parameter and change the accepted values of
> > > > `log.cleaner.compaction.strategy` to "header.<key>" instead of
> > > > keeping "header"? The original approach seems to be cleaner, and I
> > > > think this was discussed on the original discuss thread already.
> > > >
> > > > (4) Nit: For the "timestamp" compaction strategy you changed the
> > > > KIP to
> > > >
> > > > -> `The record [create] timestamp`
> > > >
> > > > This is miss leading IMHO, because it depends on the broker/log
> > > > configuration `(log.)message.timestamp.type` that can either be
> > > > `CreateTime` or `LogAppendTime` what the actual record timestamp is.
> > > > I would just remove "create" to keep it unspecified.
> > > >
> > > > (5) Nit: the section "Public Interfaces" should list the newly
> > > > introduced configs -- configuration parameters are a public
> interface.
> > > >
> > > > (6) What do you mean by "first level header lookup"? The term
> > > > "first level" indicates some hierarchy, but headers don't have any
> > > > hierarchy
> > > > -- it's just a list of key-value pairs? If you mean the _order_ of
> > > > the headers, ie, pick the first header in the list that matches
> > > > the key, please rephrase it to make it clearer.
> > > >
> > > >
> > > >
> > > > @Tom: I agree with all you are saying, however, I still think that
> > > > this KIP will improve the overall situation, because everything
> > > > you pointed out is actually true with offset based compaction, too.
> > > >
> > > > The KIP is not a silver bullet that solves all issue for
> > > > interleaved writes, but I personally believe, it's a good
> improvement.
> > > >
> > > >
> > > >
> > > > -Matthias
> > > >
> > > >
> > > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > > Hi,
> > > > >
> > > > > Please let me know if anyone has any questions on this updated
> > > KIP-280...
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Senthil
> > > > >
> > > > > -----Original Message-----
> > > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > > To: dev@kafka.apache.org
> > > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > > >
> > > > > Hi Tom,
> > > > >
> > > > > Sorry for the delayed response.
> > > > >
> > > > > Regarding the fall back to offset decision for both timestamp &
> > > > > header
> > > > value is based on the previous author discuss
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > > > li
> > > > st
> > > > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbb
> > > > cc
> > > > dd
> > > > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%
> > > > 7C
> > > > se
> > > > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f988
> > > > bf
> > > > 86
> > > > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=WpEW
> > > > 5y
> > > > lu
> > > > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > > > and as per the discussion, it is really required to avoid duplicates.
> > > > >
> > > > > And the timestamp strategy is from the original KIP author and
> > > > > we are
> > > > keeping it as is.
> > > > >
> > > > > Finally on the sequence order guarantee by the producer, it is
> > > > > not
> > > > feasible on waiting for ack in async / multi-threads/processes
> > > > scenarios and hence the header sequence based compact strategy
> > > > with producer's responsibility to have a unique sequence
> > > > generation for the topic-partition-key level.
> > > > >
> > > > > Hoping this clarifies all your questions. Please let us know if
> > > > > you have
> > > > any further questions.
> > > > >
> > > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > > > discussion on the original KIP with previous author and it would
> > > > great to hear your inputs as well.
> > > > >
> > > > > Thanks,
> > > > > Senthil
> > > > >
> > > > > -----Original Message-----
> > > > > From: Tom Bentley <tb...@redhat.com>
> > > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > > To: dev@kafka.apache.org
> > > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > > >
> > > > > Hi Senthilnathan,
> > > > >
> > > > > In the motivation isn't it a little misleading to say "On the
> > > > > producer side, we clearly preserve an order for the two
> > > > > messages, <K1, V1> <K1,
> > > > > V2>"? IMHO, the semantics of the producer are clear that having
> > > > > V2>an observed
> > > > > order of sending records from different producers is not
> > > > > sufficient to
> > > > guarantee ordering on the broker. You really need to send the 2nd
> > > > record only after the 1st record is acked. It's the difficultly of
> > > > achieving that in practice that's the true motivation for your KIP.
> > > > >
> > > > > I can see the attraction of using timestamps, but it would be
> > > > > helpful to
> > > > explain how that really solves the problem. When the producers are
> > > > in different processes on different machines you're relying on
> > > > their clocks being synchronized, which is a whole subject in
> > > > itself. Even if they're synchronized the resolution of
> > > > System.currentTimeMillis() is typically many milliseconds. If your
> > > > producers are in different threads of the same process that could
> > > > be a real problem because it
> > > makes ties quite likely.
> > > > > And you don't explain why it's OK to resolve ties using the offset.
> > > > > The
> > > > basis of your argument is that the offset is giving you the wrong
> > answer.
> > > > > So it seems to me that using it as a tiebreaker is just
> > > > > narrowing the
> > > > chances of getting the wrong answer. Maybe none of this matters
> > > > for your use case, but I think it should be spelled out in the
> > > > KIP, because it surely would matter for similar use cases.
> > > > >
> > > > > Using a sequence at least removes the problem of ties, but the
> > > > interesting bit is now in how you deal with races between
> > > > threads/processes in getting a sequence number allocated (which is
> > > > out of scope of the KIP, I guess).
> > > > > How is resolving that race any simpler that resolving the
> > > > > motivating
> > > > race by waiting for the ack of the first record sent?
> > > > >
> > > > > Kind regards,
> > > > >
> > > > > Tom
> > > > >
> > > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > > senthilm@microsoft.com.invalid> wrote:
> > > > >
> > > > >> Hi All,
> > > > >>
> > > > >> We are bring back the KIP-280 to live with small correct for
> > > > >> the discussion & voting. Thanks to previous author Luis Cabral
> > > > >> on the
> > > > >> KIP-280 initiation and we are taking over to complete and get
> > > > >> it into
> > > > 2.4...
> > > > >>
> > > > >> Below is the correction that we made to the existing KIP-280:
> > > > >>
> > > > >>   *   Allowing the compact strategy configuration at the topic
> level
> > > as
> > > > >> the log compaction is at the topic level and a broker can have
> > > > >> multiple topics. This allows the flexibility to have the
> > > > >> strategy at both broker level (i.e. for all topics within the
> > > > >> broker) and topic level (i.e. for a subset of topics within a
> broker) as well...
> > > > >>
> > > > >> KIP-280:
> > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > >> %2
> > > > >> Fc
> > > > >> wi
> > > > >> k
> > > > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEn
> > > > >> ha
> > > > >> nc
> > > > >> ed
> > > > >> %
> > > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%
> > > > >> 7C
> > > > >> 68
> > > > >> 6c
> > > > >> 3
> > > > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%
> > > > >> 7C
> > > > >> 1%
> > > > >> 7C
> > > > >> 0
> > > > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhC
> > > > >> eA
> > > > >> a7
> > > > >> Gs
> > > > >> 6
> > > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > >> %2
> > > > >> Fg
> > > > >> it
> > > > >> h
> > > > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthi
> > > > >> lm
> > > > >> %4
> > > > >> 0m
> > > > >> i
> > > > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141
> > > > >> af
> > > > >> 91
> > > > >> ab
> > > > >> 2
> > > > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjp
> > > > >> Xo
> > > > >> hE
> > > > >> Wp
> > > > >> t
> > > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage
> > > > >> in
> > > > >> progress)
> > > > >>
> > > > >> Previous Thread DISCUSS:
> > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > >> %2
> > > > >> Fl
> > > > >> is
> > > > >> t
> > > > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a53
> > > > >> 5a
> > > > >> 1a
> > > > >> fa
> > > > >> 5
> > > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C
> > > > >> 01
> > > > >> %7
> > > > >> Cs
> > > > >> e
> > > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f
> > > > >> 98
> > > > >> 8b
> > > > >> f8
> > > > >> 6
> > > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=X
> > > > >> wc
> > > > >> UW
> > > > >> WY
> > > > >> D
> > > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > > >> Previous Thread VOTE:
> > > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > > >> %2
> > > > >> Fl
> > > > >> is
> > > > >> t
> > > > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f76505
> > > > >> 83
> > > > >> 49
> > > > >> 78
> > > > >> 1
> > > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C
> > > > >> 01
> > > > >> %7
> > > > >> Cs
> > > > >> e
> > > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f
> > > > >> 98
> > > > >> 8b
> > > > >> f8
> > > > >> 6
> > > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8
> > > > >> cK
> > > > >> Qc
> > > > >> Am
> > > > >> 2
> > > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > > >>
> > > > >> Appreciate your timely action.
> > > > >>
> > > > >> PS: Initiating a separate thread as I was not able to reply to
> > > > >> the existing threads...
> > > > >>
> > > > >> Thanks,
> > > > >> Senthil
> > > > >>
> > > >
> > > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

Hi Gouzhang & Jun,

Can one of you please confirm/respond to the below mail so that I will go ahead and update the KIP and proceed.

Thanks
Senthil

- Senthil
________________________________
From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
Sent: Wednesday, November 20, 2019 5:04:20 PM
To: dev@kafka.apache.org <de...@kafka.apache.org>
Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction

<merging threads>

Hi Gouzhang & Jun,

Thanks for the detailed on the scenarios.

#51 => thanks for the details Gouzhang with example. Does followers won't be sync'ing LEO as well with leader? If yes, keeping last record always (without compaction for non-offset scenarios) would work and this needed only if the new strategy ends up removing LEO record, right? Also I couldn’t able to retrieve Jason's mail related to creating an empty message... Can you please forward if you have? Wondering how that can solve this particular issue unless creating record for random key that won't conflict with the producer/consumer keys for that topic/partition.

#53 => I see that this can happen for the low produce rate from remaining ineligible for compaction for an unbounded duration where by "delete.retention.ms" triggers that removes the tombstone record. If that's the case (please correct me if I am missing any other scenarios), then we can suggest the Kafka users to have "segment.ms" & "max.compaction.lag.ms" (as compaction won't happen on active segment) to be smaller than the "delete.retention.ms" and that should address this scenario, right?

Thanks,
Senthil

-----Original Message-----
From: Jun Rao <ju...@confluent.io>
Sent: Wednesday, November 13, 2019 9:31 AM
To: dev <de...@kafka.apache.org>
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Hi, Seth,

51. The difference is that with the offset compaction strategy, the message corresponding to the last offset is always the winning record and will never be removed. But with the new strategies, it's now possible that the message corresponding to the last offset is a losing record and needs to be removed.

53. Similarly, with the offset compaction strategy, if we see a non-tombstone record after a tombstone record, the non-tombstone record is always the winning one. However, with the new strategies, that non-tombstone record with a larger offset could be a losing record. The question is then how do we retain the tombstone long enough so that we could still recognize that the non-tombstone record should be ignored.

Thanks,

Jun

-----Original Message-----
From: Guozhang Wang <wa...@gmail.com>
Sent: Tuesday, November 12, 2019 6:09 PM
To: dev <de...@kafka.apache.org>
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Hello Senthil,

Let me try to re-iterate on Jun's comments with some context here:

51: today with the offset-only compaction strategy, the last record of the log (we call it the log-end-record, whose offset is log-end-offset) would always be preserved and not compacted. This is kinda important for replication since followers reason about the log-end-offset on the leader.
Consider this case: three replicas of a partition, leader 1 and follower 2 and 3.

Leader 1 has records a, b, c, d and d is the current last record of the partition, the current log-end-offset is 3 (assuming record a's offset is 0).
Follower 2 has replicated a, b, c, d. Log-end-offset is 3 Follower 3 has replicated a, b, c but not yet replicated d. Log-end-offset is 2.

NOTE that the compaction triggering are independent on brokers, it is possible that leader 1 triggers compaction and deletes record d, while other followers have not triggered compaction yet. At this moment the leader's log becomes a, b, c. Now let's say follower 3 fetch from leader after the compaction, it will no longer see record d.

Now suppose there's a leader migration and follower 3 becomes the new leader, it would accept new appends (say, it's e), and record e would be appended at *offset 3 *on new leader 3's log. But follower 2's offset 3's record is d still. Later let's say follower 2 also triggers compaction and also fetches the new record e from new leader 3:

Follower 2's log would be* a(0), b(1), c(2), e(4)* where the numbers in brackets are offset number; while leader 3's log would be *a(0), b(1), c(2), e(3)*. Now you see the two logs diverges in offsets, although their log entries are the same.

-------------------------------------

One way to resolve this, is to simply never remove the last message during compaction. Another way (suggested by Jason in the old VOTE thread) is to create an empty message batch to "take up" that offset slot.


53: Again here's some context on when we can delete a tombstone (null):
during compaction, if we see the latest record for a certain key is a tombstone we can remove all old records BUT that tombstone itself cannot be removed immediately since the old records may already be fetched by some consumers and that tombstone may not be fetched by consumer yet. Also that tombstone may have not been replicated to all other followers yet while the old records have already been replicated. Hence we have some config on the broker to "delay" the removal of the tombstone itself. You can find this config named "delete.retention.ms" in
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkafka.apache.org%2Fdocumentation%2F%23brokerconfigs&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C8c8b5ff80fe943faa48808d76e1ec617%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637098950758809483&amp;sdata=uMmCGtBbnOstnDTATUHrGI7quoZ7%2F4Chegmovza5MhQ%3D&amp;reserved=0

Now consider under timestamp / header based compaction strategy: a later record may still be deprecated by an early tombstone, so if that tombstone is already removed then the log compaction thread would not remove that later record and hence the logic would be broken. That's why we also need consider "delaying" the removal of the tombstone in this case.

Personally I think we can still piggy-back on the "delete.retention.ms"
since its default value is 86400000ms == 1 day, and we just need to document that if you have timestamp / header based compaction, then it's YOUR responsibility as the Kafka user to make sure that the timestamp / header out of ordering is smaller than the value of "delete.retention.ms".
Otherwise some later records with smaller timestamp / headers may not be compacted correctly since the tombstone is already gone and hence we do not have the "proof" to remove it anymore.


Does that make sense to you?

Guozhang


On Tue, Nov 12, 2019 at 9:15 AM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:

> Hi Jun,
>
> Thanks for the response and please find below the response!
>
> #50 - got it...
>
> #51 - not sure how the last record will be deleted bcoz of this new
> compact strategy. The reason I am asking is, the compaction is based
> out of offsetmap and the new strategy logic is purely within the
> offsetmap... the offsetmap will always keep track of the latest offset
> irrespective of the compaction strategy. You can have a look at the PR
> of the new compaction strategy changes:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2Fapache%2Fkafka%2Fpull%2F7528%2Ffiles&amp;data=02%7C01%7Csenth
> ilm%40microsoft.com%7C9e3a2484adc54d48122408d767de70ab%7C72f988bf86f14
> 1af91ab2d7cd011db47%7C1%7C0%7C637092077377837652&amp;sdata=j%2FxNbJloj
> YXIk8KdEe%2FIUrmy0iX6BPoNWUMM9rdjvd4%3D&amp;reserved=0
>
> #52 - sure, I have updated JIRA to include this details in the wiki.
>
> #53 - as I am pointed out in #51, the tombstone is abstract to this
> change (i.e. the tombstone is handled within LogCleaner and the
> compact strategy is by the offsetmap). this is what my understand on
> the tombstone based on the code walk-thru... please let me know if I am missing anything here...
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Jun Rao <ju...@confluent.io>
> Sent: Thursday, November 7, 2019 4:32 PM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hi, Senthil,
>
> Thanks for bringing back this KIP. Overall, this seems like a useful
> feature. A few comments below.
>
> 50. One use case for the timestamp based compaction is to resolve
> conflicts during data center failures. The failover of a data center
> typically happens much longer tha millisec. So, timestamp could be
> enough to determine the value to keep.
>
> 51. With the timestamp/header strategy, it seems that it may now be
> possible that the last record could be removed during compaction. For
> example, if the active segment is empty, the last record in the
> previous segment could be removed due to compaction. A new replica
> then won't see the true end offset of the partition. If that replica
> ever becomes the leader, it could write a different record on the same
> end offset, which will be weird.
>
> 52. With the timestamp/header strategy, the behavior of the
> application may need to change. In particular, the application can't
> just blindly take the record with a larger offset and assuming that it's the value to keep.
> It needs to check the timestamp or the header now. So, it would be
> useful to at least document this.
>
> 53. This also adds complexity for deletes. Currently, we use a null
> payload to indicate a delete tombstone. The tombstone can be removed
> once all previous records with the same key have been removed. If the
> new strategies apply to tombstones, it's not clear when a tombstone
> can be removed since subsequent records could have
> timestamp/sequenceId smaller than that in the tombstone. It would be
> useful to think this through and document the expected behavior.
>
> Jun
>
> On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy <
> senthilm@microsoft.com.invalid> wrote:
>
> > Hi Guozhang,
> >
> > Sure and I have made a note in the JIRA item to make sure the wiki
> > is updated.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Guozhang Wang <wa...@gmail.com>
> > Sent: Monday, November 4, 2019 11:00 AM
> > To: dev <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hello Senthilnathan,
> >
> > Thanks for revamping on the KIP. I have only one comment about the
> > wiki otherwise LGTM.
> >
> > 1. We should emphasize that the newly introduced config yields to
> > the existing "log.cleanup.policy", i.e. if the latter's value is
> > `delete` not `compact`, then the previous config would be ignored.
> >
> >
> > Guozhang
> >
> > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy <
> > senthilm@microsoft.com.invalid> wrote:
> >
> > > Hi all,
> > >
> > > I will start the vote thread shortly for this updated KIP. If
> > > there are any more thoughts I would love to hear them.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > Sent: Thursday, October 31, 2019 3:51 AM
> > > To: dev@kafka.apache.org
> > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Matthias
> > >
> > > Thanks for the response.
> > >
> > > (1) Yes
> > >
> > > (2) Yes, and the config name will be the same (i.e.
> > > `log.cleaner.compaction.strategy` &
> > > `log.cleaner.compaction.strategy.header`) at broker level and
> > > topic level (to override broker level default compact strategy).
> > > Please let me know if we need to keep it in different naming convention. Note:
> > > Broker level (which will be in the server.properties)
> > > configuration is optional and default it to offset. Topic level
> > > configuration will be default to broker level config...
> > >
> > > (3) By this new way, it avoids another config parameter and also
> > > in feature if any new strategy like header need addition info, no
> > > additional config required. As this got discussed already and
> > > agreed to have separate config, I will revert it. KIP updated...
> > >
> > > (4) Done
> > >
> > > (5) Updated
> > >
> > > (6) Updated to pick the first header in the list
> > >
> > > Please let me know if you have any other questions.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Matthias J. Sax <ma...@confluent.io>
> > > Sent: Thursday, October 31, 2019 12:13 AM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Thanks for picking up this KIP, Senthil.
> > >
> > > (1) As far as I remember, the main issue of the original proposal
> > > was a missing topic level configuration for the compaction strategy.
> > > With this being addressed, I am in favor of this KIP.
> > >
> > > (2) With regard to (1), it seems we would need a new topic level
> > > config `compaction.strategy`, and
> > > `log.cleaner.compaction.strategy` would be the default strategy
> > > (ie, broker level config) if a topic does
> > not overwrite it?
> > >
> > > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > > parameter and change the accepted values of
> > > `log.cleaner.compaction.strategy` to "header.<key>" instead of
> > > keeping "header"? The original approach seems to be cleaner, and I
> > > think this was discussed on the original discuss thread already.
> > >
> > > (4) Nit: For the "timestamp" compaction strategy you changed the
> > > KIP to
> > >
> > > -> `The record [create] timestamp`
> > >
> > > This is miss leading IMHO, because it depends on the broker/log
> > > configuration `(log.)message.timestamp.type` that can either be
> > > `CreateTime` or `LogAppendTime` what the actual record timestamp is.
> > > I would just remove "create" to keep it unspecified.
> > >
> > > (5) Nit: the section "Public Interfaces" should list the newly
> > > introduced configs -- configuration parameters are a public interface.
> > >
> > > (6) What do you mean by "first level header lookup"? The term
> > > "first level" indicates some hierarchy, but headers don't have any
> > > hierarchy
> > > -- it's just a list of key-value pairs? If you mean the _order_ of
> > > the headers, ie, pick the first header in the list that matches
> > > the key, please rephrase it to make it clearer.
> > >
> > >
> > >
> > > @Tom: I agree with all you are saying, however, I still think that
> > > this KIP will improve the overall situation, because everything
> > > you pointed out is actually true with offset based compaction, too.
> > >
> > > The KIP is not a silver bullet that solves all issue for
> > > interleaved writes, but I personally believe, it's a good improvement.
> > >
> > >
> > >
> > > -Matthias
> > >
> > >
> > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > Hi,
> > > >
> > > > Please let me know if anyone has any questions on this updated
> > KIP-280...
> > > >
> > > > Thanks,
> > > >
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > To: dev@kafka.apache.org
> > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Tom,
> > > >
> > > > Sorry for the delayed response.
> > > >
> > > > Regarding the fall back to offset decision for both timestamp &
> > > > header
> > > value is based on the previous author discuss
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > > li
> > > st
> > > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbb
> > > cc
> > > dd
> > > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%
> > > 7C
> > > se
> > > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f988
> > > bf
> > > 86
> > > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=WpEW
> > > 5y
> > > lu
> > > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > > and as per the discussion, it is really required to avoid duplicates.
> > > >
> > > > And the timestamp strategy is from the original KIP author and
> > > > we are
> > > keeping it as is.
> > > >
> > > > Finally on the sequence order guarantee by the producer, it is
> > > > not
> > > feasible on waiting for ack in async / multi-threads/processes
> > > scenarios and hence the header sequence based compact strategy
> > > with producer's responsibility to have a unique sequence
> > > generation for the topic-partition-key level.
> > > >
> > > > Hoping this clarifies all your questions. Please let us know if
> > > > you have
> > > any further questions.
> > > >
> > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > > discussion on the original KIP with previous author and it would
> > > great to hear your inputs as well.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Tom Bentley <tb...@redhat.com>
> > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Senthilnathan,
> > > >
> > > > In the motivation isn't it a little misleading to say "On the
> > > > producer side, we clearly preserve an order for the two
> > > > messages, <K1, V1> <K1,
> > > > V2>"? IMHO, the semantics of the producer are clear that having
> > > > V2>an observed
> > > > order of sending records from different producers is not
> > > > sufficient to
> > > guarantee ordering on the broker. You really need to send the 2nd
> > > record only after the 1st record is acked. It's the difficultly of
> > > achieving that in practice that's the true motivation for your KIP.
> > > >
> > > > I can see the attraction of using timestamps, but it would be
> > > > helpful to
> > > explain how that really solves the problem. When the producers are
> > > in different processes on different machines you're relying on
> > > their clocks being synchronized, which is a whole subject in
> > > itself. Even if they're synchronized the resolution of
> > > System.currentTimeMillis() is typically many milliseconds. If your
> > > producers are in different threads of the same process that could
> > > be a real problem because it
> > makes ties quite likely.
> > > > And you don't explain why it's OK to resolve ties using the offset.
> > > > The
> > > basis of your argument is that the offset is giving you the wrong
> answer.
> > > > So it seems to me that using it as a tiebreaker is just
> > > > narrowing the
> > > chances of getting the wrong answer. Maybe none of this matters
> > > for your use case, but I think it should be spelled out in the
> > > KIP, because it surely would matter for similar use cases.
> > > >
> > > > Using a sequence at least removes the problem of ties, but the
> > > interesting bit is now in how you deal with races between
> > > threads/processes in getting a sequence number allocated (which is
> > > out of scope of the KIP, I guess).
> > > > How is resolving that race any simpler that resolving the
> > > > motivating
> > > race by waiting for the ack of the first record sent?
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > senthilm@microsoft.com.invalid> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> We are bring back the KIP-280 to live with small correct for
> > > >> the discussion & voting. Thanks to previous author Luis Cabral
> > > >> on the
> > > >> KIP-280 initiation and we are taking over to complete and get
> > > >> it into
> > > 2.4...
> > > >>
> > > >> Below is the correction that we made to the existing KIP-280:
> > > >>
> > > >>   *   Allowing the compact strategy configuration at the topic level
> > as
> > > >> the log compaction is at the topic level and a broker can have
> > > >> multiple topics. This allows the flexibility to have the
> > > >> strategy at both broker level (i.e. for all topics within the
> > > >> broker) and topic level (i.e. for a subset of topics within a broker) as well...
> > > >>
> > > >> KIP-280:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > >> %2
> > > >> Fc
> > > >> wi
> > > >> k
> > > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEn
> > > >> ha
> > > >> nc
> > > >> ed
> > > >> %
> > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%
> > > >> 7C
> > > >> 68
> > > >> 6c
> > > >> 3
> > > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%
> > > >> 7C
> > > >> 1%
> > > >> 7C
> > > >> 0
> > > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhC
> > > >> eA
> > > >> a7
> > > >> Gs
> > > >> 6
> > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > >> %2
> > > >> Fg
> > > >> it
> > > >> h
> > > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthi
> > > >> lm
> > > >> %4
> > > >> 0m
> > > >> i
> > > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141
> > > >> af
> > > >> 91
> > > >> ab
> > > >> 2
> > > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjp
> > > >> Xo
> > > >> hE
> > > >> Wp
> > > >> t
> > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage
> > > >> in
> > > >> progress)
> > > >>
> > > >> Previous Thread DISCUSS:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > >> %2
> > > >> Fl
> > > >> is
> > > >> t
> > > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a53
> > > >> 5a
> > > >> 1a
> > > >> fa
> > > >> 5
> > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C
> > > >> 01
> > > >> %7
> > > >> Cs
> > > >> e
> > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f
> > > >> 98
> > > >> 8b
> > > >> f8
> > > >> 6
> > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=X
> > > >> wc
> > > >> UW
> > > >> WY
> > > >> D
> > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > >> Previous Thread VOTE:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > >> %2
> > > >> Fl
> > > >> is
> > > >> t
> > > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f76505
> > > >> 83
> > > >> 49
> > > >> 78
> > > >> 1
> > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C
> > > >> 01
> > > >> %7
> > > >> Cs
> > > >> e
> > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f
> > > >> 98
> > > >> 8b
> > > >> f8
> > > >> 6
> > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8
> > > >> cK
> > > >> Qc
> > > >> Am
> > > >> 2
> > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > >>
> > > >> Appreciate your timely action.
> > > >>
> > > >> PS: Initiating a separate thread as I was not able to reply to
> > > >> the existing threads...
> > > >>
> > > >> Thanks,
> > > >> Senthil
> > > >>
> > >
> > >
> >
> > --
> > -- Guozhang
> >
>


--
-- Guozhang

RE: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

<merging threads>

Hi Gouzhang & Jun,

Thanks for the detailed on the scenarios.

#51 => thanks for the details Gouzhang with example. Does followers won't be sync'ing LEO as well with leader? If yes, keeping last record always (without compaction for non-offset scenarios) would work and this needed only if the new strategy ends up removing LEO record, right? Also I couldn’t able to retrieve Jason's mail related to creating an empty message... Can you please forward if you have? Wondering how that can solve this particular issue unless creating record for random key that won't conflict with the producer/consumer keys for that topic/partition.

#53 => I see that this can happen for the low produce rate from remaining ineligible for compaction for an unbounded duration where by "delete.retention.ms" triggers that removes the tombstone record. If that's the case (please correct me if I am missing any other scenarios), then we can suggest the Kafka users to have "segment.ms" & "max.compaction.lag.ms" (as compaction won't happen on active segment) to be smaller than the "delete.retention.ms" and that should address this scenario, right?

Thanks,
Senthil

-----Original Message-----
From: Jun Rao <ju...@confluent.io> 
Sent: Wednesday, November 13, 2019 9:31 AM
To: dev <de...@kafka.apache.org>
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Hi, Seth,

51. The difference is that with the offset compaction strategy, the message corresponding to the last offset is always the winning record and will never be removed. But with the new strategies, it's now possible that the message corresponding to the last offset is a losing record and needs to be removed.

53. Similarly, with the offset compaction strategy, if we see a non-tombstone record after a tombstone record, the non-tombstone record is always the winning one. However, with the new strategies, that non-tombstone record with a larger offset could be a losing record. The question is then how do we retain the tombstone long enough so that we could still recognize that the non-tombstone record should be ignored.

Thanks,

Jun

-----Original Message-----
From: Guozhang Wang <wa...@gmail.com> 
Sent: Tuesday, November 12, 2019 6:09 PM
To: dev <de...@kafka.apache.org>
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Hello Senthil,

Let me try to re-iterate on Jun's comments with some context here:

51: today with the offset-only compaction strategy, the last record of the log (we call it the log-end-record, whose offset is log-end-offset) would always be preserved and not compacted. This is kinda important for replication since followers reason about the log-end-offset on the leader.
Consider this case: three replicas of a partition, leader 1 and follower 2 and 3.

Leader 1 has records a, b, c, d and d is the current last record of the partition, the current log-end-offset is 3 (assuming record a's offset is 0).
Follower 2 has replicated a, b, c, d. Log-end-offset is 3 Follower 3 has replicated a, b, c but not yet replicated d. Log-end-offset is 2.

NOTE that the compaction triggering are independent on brokers, it is possible that leader 1 triggers compaction and deletes record d, while other followers have not triggered compaction yet. At this moment the leader's log becomes a, b, c. Now let's say follower 3 fetch from leader after the compaction, it will no longer see record d.

Now suppose there's a leader migration and follower 3 becomes the new leader, it would accept new appends (say, it's e), and record e would be appended at *offset 3 *on new leader 3's log. But follower 2's offset 3's record is d still. Later let's say follower 2 also triggers compaction and also fetches the new record e from new leader 3:

Follower 2's log would be* a(0), b(1), c(2), e(4)* where the numbers in brackets are offset number; while leader 3's log would be *a(0), b(1), c(2), e(3)*. Now you see the two logs diverges in offsets, although their log entries are the same.

-------------------------------------

One way to resolve this, is to simply never remove the last message during compaction. Another way (suggested by Jason in the old VOTE thread) is to create an empty message batch to "take up" that offset slot.


53: Again here's some context on when we can delete a tombstone (null):
during compaction, if we see the latest record for a certain key is a tombstone we can remove all old records BUT that tombstone itself cannot be removed immediately since the old records may already be fetched by some consumers and that tombstone may not be fetched by consumer yet. Also that tombstone may have not been replicated to all other followers yet while the old records have already been replicated. Hence we have some config on the broker to "delay" the removal of the tombstone itself. You can find this config named "delete.retention.ms" in
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkafka.apache.org%2Fdocumentation%2F%23brokerconfigs&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C9e3a2484adc54d48122408d767de70ab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637092077377837652&amp;sdata=TiIBt9QNxk9UWkLhrbJ0t1vTOhbxMDMc2DaNGqLWqdA%3D&amp;reserved=0

Now consider under timestamp / header based compaction strategy: a later record may still be deprecated by an early tombstone, so if that tombstone is already removed then the log compaction thread would not remove that later record and hence the logic would be broken. That's why we also need consider "delaying" the removal of the tombstone in this case.

Personally I think we can still piggy-back on the "delete.retention.ms"
since its default value is 86400000ms == 1 day, and we just need to document that if you have timestamp / header based compaction, then it's YOUR responsibility as the Kafka user to make sure that the timestamp / header out of ordering is smaller than the value of "delete.retention.ms".
Otherwise some later records with smaller timestamp / headers may not be compacted correctly since the tombstone is already gone and hence we do not have the "proof" to remove it anymore.


Does that make sense to you?

Guozhang


On Tue, Nov 12, 2019 at 9:15 AM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:

> Hi Jun,
>
> Thanks for the response and please find below the response!
>
> #50 - got it...
>
> #51 - not sure how the last record will be deleted bcoz of this new 
> compact strategy. The reason I am asking is, the compaction is based 
> out of offsetmap and the new strategy logic is purely within the 
> offsetmap... the offsetmap will always keep track of the latest offset 
> irrespective of the compaction strategy. You can have a look at the PR 
> of the new compaction strategy changes: 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2Fapache%2Fkafka%2Fpull%2F7528%2Ffiles&amp;data=02%7C01%7Csenth
> ilm%40microsoft.com%7C9e3a2484adc54d48122408d767de70ab%7C72f988bf86f14
> 1af91ab2d7cd011db47%7C1%7C0%7C637092077377837652&amp;sdata=j%2FxNbJloj
> YXIk8KdEe%2FIUrmy0iX6BPoNWUMM9rdjvd4%3D&amp;reserved=0
>
> #52 - sure, I have updated JIRA to include this details in the wiki.
>
> #53 - as I am pointed out in #51, the tombstone is abstract to this 
> change (i.e. the tombstone is handled within LogCleaner and the 
> compact strategy is by the offsetmap). this is what my understand on 
> the tombstone based on the code walk-thru... please let me know if I am missing anything here...
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Jun Rao <ju...@confluent.io>
> Sent: Thursday, November 7, 2019 4:32 PM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hi, Senthil,
>
> Thanks for bringing back this KIP. Overall, this seems like a useful 
> feature. A few comments below.
>
> 50. One use case for the timestamp based compaction is to resolve 
> conflicts during data center failures. The failover of a data center 
> typically happens much longer tha millisec. So, timestamp could be 
> enough to determine the value to keep.
>
> 51. With the timestamp/header strategy, it seems that it may now be 
> possible that the last record could be removed during compaction. For 
> example, if the active segment is empty, the last record in the 
> previous segment could be removed due to compaction. A new replica 
> then won't see the true end offset of the partition. If that replica 
> ever becomes the leader, it could write a different record on the same 
> end offset, which will be weird.
>
> 52. With the timestamp/header strategy, the behavior of the 
> application may need to change. In particular, the application can't 
> just blindly take the record with a larger offset and assuming that it's the value to keep.
> It needs to check the timestamp or the header now. So, it would be 
> useful to at least document this.
>
> 53. This also adds complexity for deletes. Currently, we use a null 
> payload to indicate a delete tombstone. The tombstone can be removed 
> once all previous records with the same key have been removed. If the 
> new strategies apply to tombstones, it's not clear when a tombstone 
> can be removed since subsequent records could have 
> timestamp/sequenceId smaller than that in the tombstone. It would be 
> useful to think this through and document the expected behavior.
>
> Jun
>
> On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy < 
> senthilm@microsoft.com.invalid> wrote:
>
> > Hi Guozhang,
> >
> > Sure and I have made a note in the JIRA item to make sure the wiki 
> > is updated.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Guozhang Wang <wa...@gmail.com>
> > Sent: Monday, November 4, 2019 11:00 AM
> > To: dev <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hello Senthilnathan,
> >
> > Thanks for revamping on the KIP. I have only one comment about the 
> > wiki otherwise LGTM.
> >
> > 1. We should emphasize that the newly introduced config yields to 
> > the existing "log.cleanup.policy", i.e. if the latter's value is 
> > `delete` not `compact`, then the previous config would be ignored.
> >
> >
> > Guozhang
> >
> > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy < 
> > senthilm@microsoft.com.invalid> wrote:
> >
> > > Hi all,
> > >
> > > I will start the vote thread shortly for this updated KIP. If 
> > > there are any more thoughts I would love to hear them.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > Sent: Thursday, October 31, 2019 3:51 AM
> > > To: dev@kafka.apache.org
> > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Matthias
> > >
> > > Thanks for the response.
> > >
> > > (1) Yes
> > >
> > > (2) Yes, and the config name will be the same (i.e.
> > > `log.cleaner.compaction.strategy` &
> > > `log.cleaner.compaction.strategy.header`) at broker level and 
> > > topic level (to override broker level default compact strategy). 
> > > Please let me know if we need to keep it in different naming convention. Note:
> > > Broker level (which will be in the server.properties) 
> > > configuration is optional and default it to offset. Topic level 
> > > configuration will be default to broker level config...
> > >
> > > (3) By this new way, it avoids another config parameter and also 
> > > in feature if any new strategy like header need addition info, no 
> > > additional config required. As this got discussed already and 
> > > agreed to have separate config, I will revert it. KIP updated...
> > >
> > > (4) Done
> > >
> > > (5) Updated
> > >
> > > (6) Updated to pick the first header in the list
> > >
> > > Please let me know if you have any other questions.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Matthias J. Sax <ma...@confluent.io>
> > > Sent: Thursday, October 31, 2019 12:13 AM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Thanks for picking up this KIP, Senthil.
> > >
> > > (1) As far as I remember, the main issue of the original proposal 
> > > was a missing topic level configuration for the compaction strategy.
> > > With this being addressed, I am in favor of this KIP.
> > >
> > > (2) With regard to (1), it seems we would need a new topic level 
> > > config `compaction.strategy`, and 
> > > `log.cleaner.compaction.strategy` would be the default strategy 
> > > (ie, broker level config) if a topic does
> > not overwrite it?
> > >
> > > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > > parameter and change the accepted values of 
> > > `log.cleaner.compaction.strategy` to "header.<key>" instead of 
> > > keeping "header"? The original approach seems to be cleaner, and I 
> > > think this was discussed on the original discuss thread already.
> > >
> > > (4) Nit: For the "timestamp" compaction strategy you changed the 
> > > KIP to
> > >
> > > -> `The record [create] timestamp`
> > >
> > > This is miss leading IMHO, because it depends on the broker/log 
> > > configuration `(log.)message.timestamp.type` that can either be 
> > > `CreateTime` or `LogAppendTime` what the actual record timestamp is.
> > > I would just remove "create" to keep it unspecified.
> > >
> > > (5) Nit: the section "Public Interfaces" should list the newly 
> > > introduced configs -- configuration parameters are a public interface.
> > >
> > > (6) What do you mean by "first level header lookup"? The term 
> > > "first level" indicates some hierarchy, but headers don't have any 
> > > hierarchy
> > > -- it's just a list of key-value pairs? If you mean the _order_ of 
> > > the headers, ie, pick the first header in the list that matches 
> > > the key, please rephrase it to make it clearer.
> > >
> > >
> > >
> > > @Tom: I agree with all you are saying, however, I still think that 
> > > this KIP will improve the overall situation, because everything 
> > > you pointed out is actually true with offset based compaction, too.
> > >
> > > The KIP is not a silver bullet that solves all issue for 
> > > interleaved writes, but I personally believe, it's a good improvement.
> > >
> > >
> > >
> > > -Matthias
> > >
> > >
> > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > Hi,
> > > >
> > > > Please let me know if anyone has any questions on this updated
> > KIP-280...
> > > >
> > > > Thanks,
> > > >
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > To: dev@kafka.apache.org
> > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Tom,
> > > >
> > > > Sorry for the delayed response.
> > > >
> > > > Regarding the fall back to offset decision for both timestamp & 
> > > > header
> > > value is based on the previous author discuss 
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > > li
> > > st
> > > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbb
> > > cc
> > > dd
> > > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%
> > > 7C
> > > se
> > > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f988
> > > bf
> > > 86
> > > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=WpEW
> > > 5y
> > > lu
> > > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > > and as per the discussion, it is really required to avoid duplicates.
> > > >
> > > > And the timestamp strategy is from the original KIP author and 
> > > > we are
> > > keeping it as is.
> > > >
> > > > Finally on the sequence order guarantee by the producer, it is 
> > > > not
> > > feasible on waiting for ack in async / multi-threads/processes 
> > > scenarios and hence the header sequence based compact strategy 
> > > with producer's responsibility to have a unique sequence 
> > > generation for the topic-partition-key level.
> > > >
> > > > Hoping this clarifies all your questions. Please let us know if 
> > > > you have
> > > any further questions.
> > > >
> > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > > discussion on the original KIP with previous author and it would 
> > > great to hear your inputs as well.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Tom Bentley <tb...@redhat.com>
> > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Senthilnathan,
> > > >
> > > > In the motivation isn't it a little misleading to say "On the 
> > > > producer side, we clearly preserve an order for the two 
> > > > messages, <K1, V1> <K1,
> > > > V2>"? IMHO, the semantics of the producer are clear that having 
> > > > V2>an observed
> > > > order of sending records from different producers is not 
> > > > sufficient to
> > > guarantee ordering on the broker. You really need to send the 2nd 
> > > record only after the 1st record is acked. It's the difficultly of 
> > > achieving that in practice that's the true motivation for your KIP.
> > > >
> > > > I can see the attraction of using timestamps, but it would be 
> > > > helpful to
> > > explain how that really solves the problem. When the producers are 
> > > in different processes on different machines you're relying on 
> > > their clocks being synchronized, which is a whole subject in 
> > > itself. Even if they're synchronized the resolution of 
> > > System.currentTimeMillis() is typically many milliseconds. If your 
> > > producers are in different threads of the same process that could 
> > > be a real problem because it
> > makes ties quite likely.
> > > > And you don't explain why it's OK to resolve ties using the offset.
> > > > The
> > > basis of your argument is that the offset is giving you the wrong
> answer.
> > > > So it seems to me that using it as a tiebreaker is just 
> > > > narrowing the
> > > chances of getting the wrong answer. Maybe none of this matters 
> > > for your use case, but I think it should be spelled out in the 
> > > KIP, because it surely would matter for similar use cases.
> > > >
> > > > Using a sequence at least removes the problem of ties, but the
> > > interesting bit is now in how you deal with races between 
> > > threads/processes in getting a sequence number allocated (which is 
> > > out of scope of the KIP, I guess).
> > > > How is resolving that race any simpler that resolving the 
> > > > motivating
> > > race by waiting for the ack of the first record sent?
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > senthilm@microsoft.com.invalid> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> We are bring back the KIP-280 to live with small correct for 
> > > >> the discussion & voting. Thanks to previous author Luis Cabral 
> > > >> on the
> > > >> KIP-280 initiation and we are taking over to complete and get 
> > > >> it into
> > > 2.4...
> > > >>
> > > >> Below is the correction that we made to the existing KIP-280:
> > > >>
> > > >>   *   Allowing the compact strategy configuration at the topic level
> > as
> > > >> the log compaction is at the topic level and a broker can have 
> > > >> multiple topics. This allows the flexibility to have the 
> > > >> strategy at both broker level (i.e. for all topics within the 
> > > >> broker) and topic level (i.e. for a subset of topics within a broker) as well...
> > > >>
> > > >> KIP-280:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > >> %2
> > > >> Fc
> > > >> wi
> > > >> k
> > > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEn
> > > >> ha
> > > >> nc
> > > >> ed
> > > >> %
> > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%
> > > >> 7C
> > > >> 68
> > > >> 6c
> > > >> 3
> > > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%
> > > >> 7C
> > > >> 1%
> > > >> 7C
> > > >> 0
> > > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhC
> > > >> eA
> > > >> a7
> > > >> Gs
> > > >> 6
> > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > >> %2
> > > >> Fg
> > > >> it
> > > >> h
> > > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthi
> > > >> lm
> > > >> %4
> > > >> 0m
> > > >> i
> > > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141
> > > >> af
> > > >> 91
> > > >> ab
> > > >> 2
> > > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjp
> > > >> Xo
> > > >> hE
> > > >> Wp
> > > >> t
> > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage 
> > > >> in
> > > >> progress)
> > > >>
> > > >> Previous Thread DISCUSS:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > >> %2
> > > >> Fl
> > > >> is
> > > >> t
> > > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a53
> > > >> 5a
> > > >> 1a
> > > >> fa
> > > >> 5
> > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C
> > > >> 01
> > > >> %7
> > > >> Cs
> > > >> e
> > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f
> > > >> 98
> > > >> 8b
> > > >> f8
> > > >> 6
> > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=X
> > > >> wc
> > > >> UW
> > > >> WY
> > > >> D
> > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > >> Previous Thread VOTE:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> > > >> %2
> > > >> Fl
> > > >> is
> > > >> t
> > > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f76505
> > > >> 83
> > > >> 49
> > > >> 78
> > > >> 1
> > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C
> > > >> 01
> > > >> %7
> > > >> Cs
> > > >> e
> > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f
> > > >> 98
> > > >> 8b
> > > >> f8
> > > >> 6
> > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8
> > > >> cK
> > > >> Qc
> > > >> Am
> > > >> 2
> > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > >>
> > > >> Appreciate your timely action.
> > > >>
> > > >> PS: Initiating a separate thread as I was not able to reply to 
> > > >> the existing threads...
> > > >>
> > > >> Thanks,
> > > >> Senthil
> > > >>
> > >
> > >
> >
> > --
> > -- Guozhang
> >
>


--
-- Guozhang

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Senthil,

Let me try to re-iterate on Jun's comments with some context here:

51: today with the offset-only compaction strategy, the last record of the
log (we call it the log-end-record, whose offset is log-end-offset) would
always be preserved and not compacted. This is kinda important for
replication since followers reason about the log-end-offset on the leader.
Consider this case: three replicas of a partition, leader 1 and follower 2
and 3.

Leader 1 has records a, b, c, d and d is the current last record of the
partition, the current log-end-offset is 3 (assuming record a's offset is
0).
Follower 2 has replicated a, b, c, d. Log-end-offset is 3
Follower 3 has replicated a, b, c but not yet replicated d. Log-end-offset
is 2.

NOTE that the compaction triggering are independent on brokers, it is
possible that leader 1 triggers compaction and deletes record d, while
other followers have not triggered compaction yet. At this moment the
leader's log becomes a, b, c. Now let's say follower 3 fetch from leader
after the compaction, it will no longer see record d.

Now suppose there's a leader migration and follower 3 becomes the new
leader, it would accept new appends (say, it's e), and record e would be
appended at *offset 3 *on new leader 3's log. But follower 2's offset 3's
record is d still. Later let's say follower 2 also triggers compaction and
also fetches the new record e from new leader 3:

Follower 2's log would be* a(0), b(1), c(2), e(4)* where the numbers in
brackets are offset number; while leader 3's log would be *a(0), b(1),
c(2), e(3)*. Now you see the two logs diverges in offsets, although their
log entries are the same.

-------------------------------------

One way to resolve this, is to simply never remove the last message during
compaction. Another way (suggested by Jason in the old VOTE thread) is to
create an empty message batch to "take up" that offset slot.


53: Again here's some context on when we can delete a tombstone (null):
during compaction, if we see the latest record for a certain key is a
tombstone we can remove all old records BUT that tombstone itself cannot be
removed immediately since the old records may already be fetched by some
consumers and that tombstone may not be fetched by consumer yet. Also that
tombstone may have not been replicated to all other followers yet while the
old records have already been replicated. Hence we have some config on the
broker to "delay" the removal of the tombstone itself. You can find this
config named "delete.retention.ms" in
https://kafka.apache.org/documentation/#brokerconfigs

Now consider under timestamp / header based compaction strategy: a later
record may still be deprecated by an early tombstone, so if that tombstone
is already removed then the log compaction thread would not remove that
later record and hence the logic would be broken. That's why we also need
consider "delaying" the removal of the tombstone in this case.

Personally I think we can still piggy-back on the "delete.retention.ms"
since its default value is 86400000ms == 1 day, and we just need to
document that if you have timestamp / header based compaction, then it's
YOUR responsibility as the Kafka user to make sure that the timestamp /
header out of ordering is smaller than the value of "delete.retention.ms".
Otherwise some later records with smaller timestamp / headers may not be
compacted correctly since the tombstone is already gone and hence we do not
have the "proof" to remove it anymore.


Does that make sense to you?

Guozhang


On Tue, Nov 12, 2019 at 9:15 AM Senthilnathan Muthusamy
<se...@microsoft.com.invalid> wrote:

> Hi Jun,
>
> Thanks for the response and please find below the response!
>
> #50 - got it...
>
> #51 - not sure how the last record will be deleted bcoz of this new
> compact strategy. The reason I am asking is, the compaction is based out of
> offsetmap and the new strategy logic is purely within the offsetmap... the
> offsetmap will always keep track of the latest offset irrespective of the
> compaction strategy. You can have a look at the PR of the new compaction
> strategy changes: https://github.com/apache/kafka/pull/7528/files
>
> #52 - sure, I have updated JIRA to include this details in the wiki.
>
> #53 - as I am pointed out in #51, the tombstone is abstract to this change
> (i.e. the tombstone is handled within LogCleaner and the compact strategy
> is by the offsetmap). this is what my understand on the tombstone based on
> the code walk-thru... please let me know if I am missing anything here...
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Jun Rao <ju...@confluent.io>
> Sent: Thursday, November 7, 2019 4:32 PM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hi, Senthil,
>
> Thanks for bringing back this KIP. Overall, this seems like a useful
> feature. A few comments below.
>
> 50. One use case for the timestamp based compaction is to resolve
> conflicts during data center failures. The failover of a data center
> typically happens much longer tha millisec. So, timestamp could be enough
> to determine the value to keep.
>
> 51. With the timestamp/header strategy, it seems that it may now be
> possible that the last record could be removed during compaction. For
> example, if the active segment is empty, the last record in the previous
> segment could be removed due to compaction. A new replica then won't see
> the true end offset of the partition. If that replica ever becomes the
> leader, it could write a different record on the same end offset, which
> will be weird.
>
> 52. With the timestamp/header strategy, the behavior of the application
> may need to change. In particular, the application can't just blindly take
> the record with a larger offset and assuming that it's the value to keep.
> It needs to check the timestamp or the header now. So, it would be useful
> to at least document this.
>
> 53. This also adds complexity for deletes. Currently, we use a null
> payload to indicate a delete tombstone. The tombstone can be removed once
> all previous records with the same key have been removed. If the new
> strategies apply to tombstones, it's not clear when a tombstone can be
> removed since subsequent records could have timestamp/sequenceId smaller
> than that in the tombstone. It would be useful to think this through and
> document the expected behavior.
>
> Jun
>
> On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy <
> senthilm@microsoft.com.invalid> wrote:
>
> > Hi Guozhang,
> >
> > Sure and I have made a note in the JIRA item to make sure the wiki is
> > updated.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Guozhang Wang <wa...@gmail.com>
> > Sent: Monday, November 4, 2019 11:00 AM
> > To: dev <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hello Senthilnathan,
> >
> > Thanks for revamping on the KIP. I have only one comment about the
> > wiki otherwise LGTM.
> >
> > 1. We should emphasize that the newly introduced config yields to the
> > existing "log.cleanup.policy", i.e. if the latter's value is `delete`
> > not `compact`, then the previous config would be ignored.
> >
> >
> > Guozhang
> >
> > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy <
> > senthilm@microsoft.com.invalid> wrote:
> >
> > > Hi all,
> > >
> > > I will start the vote thread shortly for this updated KIP. If there
> > > are any more thoughts I would love to hear them.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > Sent: Thursday, October 31, 2019 3:51 AM
> > > To: dev@kafka.apache.org
> > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Matthias
> > >
> > > Thanks for the response.
> > >
> > > (1) Yes
> > >
> > > (2) Yes, and the config name will be the same (i.e.
> > > `log.cleaner.compaction.strategy` &
> > > `log.cleaner.compaction.strategy.header`) at broker level and topic
> > > level (to override broker level default compact strategy). Please
> > > let me know if we need to keep it in different naming convention. Note:
> > > Broker level (which will be in the server.properties) configuration
> > > is optional and default it to offset. Topic level configuration will
> > > be default to broker level config...
> > >
> > > (3) By this new way, it avoids another config parameter and also in
> > > feature if any new strategy like header need addition info, no
> > > additional config required. As this got discussed already and agreed
> > > to have separate config, I will revert it. KIP updated...
> > >
> > > (4) Done
> > >
> > > (5) Updated
> > >
> > > (6) Updated to pick the first header in the list
> > >
> > > Please let me know if you have any other questions.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Matthias J. Sax <ma...@confluent.io>
> > > Sent: Thursday, October 31, 2019 12:13 AM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Thanks for picking up this KIP, Senthil.
> > >
> > > (1) As far as I remember, the main issue of the original proposal
> > > was a missing topic level configuration for the compaction strategy.
> > > With this being addressed, I am in favor of this KIP.
> > >
> > > (2) With regard to (1), it seems we would need a new topic level
> > > config `compaction.strategy`, and `log.cleaner.compaction.strategy`
> > > would be the default strategy (ie, broker level config) if a topic
> > > does
> > not overwrite it?
> > >
> > > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > > parameter and change the accepted values of
> > > `log.cleaner.compaction.strategy` to "header.<key>" instead of
> > > keeping "header"? The original approach seems to be cleaner, and I
> > > think this was discussed on the original discuss thread already.
> > >
> > > (4) Nit: For the "timestamp" compaction strategy you changed the KIP
> > > to
> > >
> > > -> `The record [create] timestamp`
> > >
> > > This is miss leading IMHO, because it depends on the broker/log
> > > configuration `(log.)message.timestamp.type` that can either be
> > > `CreateTime` or `LogAppendTime` what the actual record timestamp is.
> > > I would just remove "create" to keep it unspecified.
> > >
> > > (5) Nit: the section "Public Interfaces" should list the newly
> > > introduced configs -- configuration parameters are a public interface.
> > >
> > > (6) What do you mean by "first level header lookup"? The term "first
> > > level" indicates some hierarchy, but headers don't have any
> > > hierarchy
> > > -- it's just a list of key-value pairs? If you mean the _order_ of
> > > the headers, ie, pick the first header in the list that matches the
> > > key, please rephrase it to make it clearer.
> > >
> > >
> > >
> > > @Tom: I agree with all you are saying, however, I still think that
> > > this KIP will improve the overall situation, because everything you
> > > pointed out is actually true with offset based compaction, too.
> > >
> > > The KIP is not a silver bullet that solves all issue for interleaved
> > > writes, but I personally believe, it's a good improvement.
> > >
> > >
> > >
> > > -Matthias
> > >
> > >
> > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > Hi,
> > > >
> > > > Please let me know if anyone has any questions on this updated
> > KIP-280...
> > > >
> > > > Thanks,
> > > >
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > To: dev@kafka.apache.org
> > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Tom,
> > > >
> > > > Sorry for the delayed response.
> > > >
> > > > Regarding the fall back to offset decision for both timestamp &
> > > > header
> > > value is based on the previous author discuss
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
> > > st
> > > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbcc
> > > dd
> > > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7C
> > > se
> > > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f988bf
> > > 86
> > > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=WpEW5y
> > > lu
> > > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > > and as per the discussion, it is really required to avoid duplicates.
> > > >
> > > > And the timestamp strategy is from the original KIP author and we
> > > > are
> > > keeping it as is.
> > > >
> > > > Finally on the sequence order guarantee by the producer, it is not
> > > feasible on waiting for ack in async / multi-threads/processes
> > > scenarios and hence the header sequence based compact strategy with
> > > producer's responsibility to have a unique sequence generation for
> > > the topic-partition-key level.
> > > >
> > > > Hoping this clarifies all your questions. Please let us know if
> > > > you have
> > > any further questions.
> > > >
> > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > > discussion on the original KIP with previous author and it would
> > > great to hear your inputs as well.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Tom Bentley <tb...@redhat.com>
> > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Senthilnathan,
> > > >
> > > > In the motivation isn't it a little misleading to say "On the
> > > > producer side, we clearly preserve an order for the two messages,
> > > > <K1, V1> <K1,
> > > > V2>"? IMHO, the semantics of the producer are clear that having an
> > > > V2>observed
> > > > order of sending records from different producers is not
> > > > sufficient to
> > > guarantee ordering on the broker. You really need to send the 2nd
> > > record only after the 1st record is acked. It's the difficultly of
> > > achieving that in practice that's the true motivation for your KIP.
> > > >
> > > > I can see the attraction of using timestamps, but it would be
> > > > helpful to
> > > explain how that really solves the problem. When the producers are
> > > in different processes on different machines you're relying on their
> > > clocks being synchronized, which is a whole subject in itself. Even
> > > if they're synchronized the resolution of System.currentTimeMillis()
> > > is typically many milliseconds. If your producers are in different
> > > threads of the same process that could be a real problem because it
> > makes ties quite likely.
> > > > And you don't explain why it's OK to resolve ties using the offset.
> > > > The
> > > basis of your argument is that the offset is giving you the wrong
> answer.
> > > > So it seems to me that using it as a tiebreaker is just narrowing
> > > > the
> > > chances of getting the wrong answer. Maybe none of this matters for
> > > your use case, but I think it should be spelled out in the KIP,
> > > because it surely would matter for similar use cases.
> > > >
> > > > Using a sequence at least removes the problem of ties, but the
> > > interesting bit is now in how you deal with races between
> > > threads/processes in getting a sequence number allocated (which is
> > > out of scope of the KIP, I guess).
> > > > How is resolving that race any simpler that resolving the
> > > > motivating
> > > race by waiting for the ack of the first record sent?
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > senthilm@microsoft.com.invalid> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> We are bring back the KIP-280 to live with small correct for the
> > > >> discussion & voting. Thanks to previous author Luis Cabral on the
> > > >> KIP-280 initiation and we are taking over to complete and get it
> > > >> into
> > > 2.4...
> > > >>
> > > >> Below is the correction that we made to the existing KIP-280:
> > > >>
> > > >>   *   Allowing the compact strategy configuration at the topic level
> > as
> > > >> the log compaction is at the topic level and a broker can have
> > > >> multiple topics. This allows the flexibility to have the strategy
> > > >> at both broker level (i.e. for all topics within the broker) and
> > > >> topic level (i.e. for a subset of topics within a broker) as well...
> > > >>
> > > >> KIP-280:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > >> Fc
> > > >> wi
> > > >> k
> > > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnha
> > > >> nc
> > > >> ed
> > > >> %
> > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C
> > > >> 68
> > > >> 6c
> > > >> 3
> > > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C
> > > >> 1%
> > > >> 7C
> > > >> 0
> > > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeA
> > > >> a7
> > > >> Gs
> > > >> 6
> > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > >> Fg
> > > >> it
> > > >> h
> > > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm
> > > >> %4
> > > >> 0m
> > > >> i
> > > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af
> > > >> 91
> > > >> ab
> > > >> 2
> > > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXo
> > > >> hE
> > > >> Wp
> > > >> t
> > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
> > > >> progress)
> > > >>
> > > >> Previous Thread DISCUSS:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > >> Fl
> > > >> is
> > > >> t
> > > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a
> > > >> 1a
> > > >> fa
> > > >> 5
> > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01
> > > >> %7
> > > >> Cs
> > > >> e
> > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f98
> > > >> 8b
> > > >> f8
> > > >> 6
> > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=Xwc
> > > >> UW
> > > >> WY
> > > >> D
> > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > >> Previous Thread VOTE:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > >> Fl
> > > >> is
> > > >> t
> > > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f7650583
> > > >> 49
> > > >> 78
> > > >> 1
> > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01
> > > >> %7
> > > >> Cs
> > > >> e
> > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f98
> > > >> 8b
> > > >> f8
> > > >> 6
> > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cK
> > > >> Qc
> > > >> Am
> > > >> 2
> > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > >>
> > > >> Appreciate your timely action.
> > > >>
> > > >> PS: Initiating a separate thread as I was not able to reply to
> > > >> the existing threads...
> > > >>
> > > >> Thanks,
> > > >> Senthil
> > > >>
> > >
> > >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Jun Rao <ju...@confluent.io>.

Hi, Seth,

51. The difference is that with the offset compaction strategy, the message
corresponding to the last offset is always the winning record and will
never be removed. But with the new strategies, it's now possible that the
message corresponding to the last offset is a losing record and needs to be
removed.

53. Similarly, with the offset compaction strategy, if we see a
non-tombstone record after a tombstone record, the non-tombstone record is
always the winning one. However, with the new strategies, that
non-tombstone record with a larger offset could be a losing record. The
question is then how do we retain the tombstone long enough so that we
could still recognize that the non-tombstone record should be ignored.

Thanks,

Jun

On Mon, Nov 11, 2019 at 5:15 PM Senthilnathan Muthusamy
<se...@microsoft.com.invalid> wrote:

> Hi Jun,
>
> Thanks for the response and please find below the response!
>
> #50 - got it...
>
> #51 - not sure how the last record will be deleted bcoz of this new
> compact strategy. The reason I am asking is, the compaction is based out of
> offsetmap and the new strategy logic is purely within the offsetmap... the
> offsetmap will always keep track of the latest offset irrespective of the
> compaction strategy. You can have a look at the PR of the new compaction
> strategy changes: https://github.com/apache/kafka/pull/7528/files
>
> #52 - sure, I have updated JIRA to include this details in the wiki.
>
> #53 - as I am pointed out in #51, the tombstone is abstract to this change
> (i.e. the tombstone is handled within LogCleaner and the compact strategy
> is by the offsetmap). this is what my understand on the tombstone based on
> the code walk-thru... please let me know if I am missing anything here...
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Jun Rao <ju...@confluent.io>
> Sent: Thursday, November 7, 2019 4:32 PM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hi, Senthil,
>
> Thanks for bringing back this KIP. Overall, this seems like a useful
> feature. A few comments below.
>
> 50. One use case for the timestamp based compaction is to resolve
> conflicts during data center failures. The failover of a data center
> typically happens much longer tha millisec. So, timestamp could be enough
> to determine the value to keep.
>
> 51. With the timestamp/header strategy, it seems that it may now be
> possible that the last record could be removed during compaction. For
> example, if the active segment is empty, the last record in the previous
> segment could be removed due to compaction. A new replica then won't see
> the true end offset of the partition. If that replica ever becomes the
> leader, it could write a different record on the same end offset, which
> will be weird.
>
> 52. With the timestamp/header strategy, the behavior of the application
> may need to change. In particular, the application can't just blindly take
> the record with a larger offset and assuming that it's the value to keep.
> It needs to check the timestamp or the header now. So, it would be useful
> to at least document this.
>
> 53. This also adds complexity for deletes. Currently, we use a null
> payload to indicate a delete tombstone. The tombstone can be removed once
> all previous records with the same key have been removed. If the new
> strategies apply to tombstones, it's not clear when a tombstone can be
> removed since subsequent records could have timestamp/sequenceId smaller
> than that in the tombstone. It would be useful to think this through and
> document the expected behavior.
>
> Jun
>
> On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy <
> senthilm@microsoft.com.invalid> wrote:
>
> > Hi Guozhang,
> >
> > Sure and I have made a note in the JIRA item to make sure the wiki is
> > updated.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Guozhang Wang <wa...@gmail.com>
> > Sent: Monday, November 4, 2019 11:00 AM
> > To: dev <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hello Senthilnathan,
> >
> > Thanks for revamping on the KIP. I have only one comment about the
> > wiki otherwise LGTM.
> >
> > 1. We should emphasize that the newly introduced config yields to the
> > existing "log.cleanup.policy", i.e. if the latter's value is `delete`
> > not `compact`, then the previous config would be ignored.
> >
> >
> > Guozhang
> >
> > On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy <
> > senthilm@microsoft.com.invalid> wrote:
> >
> > > Hi all,
> > >
> > > I will start the vote thread shortly for this updated KIP. If there
> > > are any more thoughts I would love to hear them.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > Sent: Thursday, October 31, 2019 3:51 AM
> > > To: dev@kafka.apache.org
> > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Matthias
> > >
> > > Thanks for the response.
> > >
> > > (1) Yes
> > >
> > > (2) Yes, and the config name will be the same (i.e.
> > > `log.cleaner.compaction.strategy` &
> > > `log.cleaner.compaction.strategy.header`) at broker level and topic
> > > level (to override broker level default compact strategy). Please
> > > let me know if we need to keep it in different naming convention. Note:
> > > Broker level (which will be in the server.properties) configuration
> > > is optional and default it to offset. Topic level configuration will
> > > be default to broker level config...
> > >
> > > (3) By this new way, it avoids another config parameter and also in
> > > feature if any new strategy like header need addition info, no
> > > additional config required. As this got discussed already and agreed
> > > to have separate config, I will revert it. KIP updated...
> > >
> > > (4) Done
> > >
> > > (5) Updated
> > >
> > > (6) Updated to pick the first header in the list
> > >
> > > Please let me know if you have any other questions.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Matthias J. Sax <ma...@confluent.io>
> > > Sent: Thursday, October 31, 2019 12:13 AM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Thanks for picking up this KIP, Senthil.
> > >
> > > (1) As far as I remember, the main issue of the original proposal
> > > was a missing topic level configuration for the compaction strategy.
> > > With this being addressed, I am in favor of this KIP.
> > >
> > > (2) With regard to (1), it seems we would need a new topic level
> > > config `compaction.strategy`, and `log.cleaner.compaction.strategy`
> > > would be the default strategy (ie, broker level config) if a topic
> > > does
> > not overwrite it?
> > >
> > > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > > parameter and change the accepted values of
> > > `log.cleaner.compaction.strategy` to "header.<key>" instead of
> > > keeping "header"? The original approach seems to be cleaner, and I
> > > think this was discussed on the original discuss thread already.
> > >
> > > (4) Nit: For the "timestamp" compaction strategy you changed the KIP
> > > to
> > >
> > > -> `The record [create] timestamp`
> > >
> > > This is miss leading IMHO, because it depends on the broker/log
> > > configuration `(log.)message.timestamp.type` that can either be
> > > `CreateTime` or `LogAppendTime` what the actual record timestamp is.
> > > I would just remove "create" to keep it unspecified.
> > >
> > > (5) Nit: the section "Public Interfaces" should list the newly
> > > introduced configs -- configuration parameters are a public interface.
> > >
> > > (6) What do you mean by "first level header lookup"? The term "first
> > > level" indicates some hierarchy, but headers don't have any
> > > hierarchy
> > > -- it's just a list of key-value pairs? If you mean the _order_ of
> > > the headers, ie, pick the first header in the list that matches the
> > > key, please rephrase it to make it clearer.
> > >
> > >
> > >
> > > @Tom: I agree with all you are saying, however, I still think that
> > > this KIP will improve the overall situation, because everything you
> > > pointed out is actually true with offset based compaction, too.
> > >
> > > The KIP is not a silver bullet that solves all issue for interleaved
> > > writes, but I personally believe, it's a good improvement.
> > >
> > >
> > >
> > > -Matthias
> > >
> > >
> > > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > > Hi,
> > > >
> > > > Please let me know if anyone has any questions on this updated
> > KIP-280...
> > > >
> > > > Thanks,
> > > >
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > > Sent: Monday, October 28, 2019 11:36 PM
> > > > To: dev@kafka.apache.org
> > > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Tom,
> > > >
> > > > Sorry for the delayed response.
> > > >
> > > > Regarding the fall back to offset decision for both timestamp &
> > > > header
> > > value is based on the previous author discuss
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
> > > st
> > > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbcc
> > > dd
> > > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7C
> > > se
> > > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f988bf
> > > 86
> > > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=WpEW5y
> > > lu
> > > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > > and as per the discussion, it is really required to avoid duplicates.
> > > >
> > > > And the timestamp strategy is from the original KIP author and we
> > > > are
> > > keeping it as is.
> > > >
> > > > Finally on the sequence order guarantee by the producer, it is not
> > > feasible on waiting for ack in async / multi-threads/processes
> > > scenarios and hence the header sequence based compact strategy with
> > > producer's responsibility to have a unique sequence generation for
> > > the topic-partition-key level.
> > > >
> > > > Hoping this clarifies all your questions. Please let us know if
> > > > you have
> > > any further questions.
> > > >
> > > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > > discussion on the original KIP with previous author and it would
> > > great to hear your inputs as well.
> > > >
> > > > Thanks,
> > > > Senthil
> > > >
> > > > -----Original Message-----
> > > > From: Tom Bentley <tb...@redhat.com>
> > > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > > To: dev@kafka.apache.org
> > > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > > >
> > > > Hi Senthilnathan,
> > > >
> > > > In the motivation isn't it a little misleading to say "On the
> > > > producer side, we clearly preserve an order for the two messages,
> > > > <K1, V1> <K1,
> > > > V2>"? IMHO, the semantics of the producer are clear that having an
> > > > V2>observed
> > > > order of sending records from different producers is not
> > > > sufficient to
> > > guarantee ordering on the broker. You really need to send the 2nd
> > > record only after the 1st record is acked. It's the difficultly of
> > > achieving that in practice that's the true motivation for your KIP.
> > > >
> > > > I can see the attraction of using timestamps, but it would be
> > > > helpful to
> > > explain how that really solves the problem. When the producers are
> > > in different processes on different machines you're relying on their
> > > clocks being synchronized, which is a whole subject in itself. Even
> > > if they're synchronized the resolution of System.currentTimeMillis()
> > > is typically many milliseconds. If your producers are in different
> > > threads of the same process that could be a real problem because it
> > makes ties quite likely.
> > > > And you don't explain why it's OK to resolve ties using the offset.
> > > > The
> > > basis of your argument is that the offset is giving you the wrong
> answer.
> > > > So it seems to me that using it as a tiebreaker is just narrowing
> > > > the
> > > chances of getting the wrong answer. Maybe none of this matters for
> > > your use case, but I think it should be spelled out in the KIP,
> > > because it surely would matter for similar use cases.
> > > >
> > > > Using a sequence at least removes the problem of ties, but the
> > > interesting bit is now in how you deal with races between
> > > threads/processes in getting a sequence number allocated (which is
> > > out of scope of the KIP, I guess).
> > > > How is resolving that race any simpler that resolving the
> > > > motivating
> > > race by waiting for the ack of the first record sent?
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > > senthilm@microsoft.com.invalid> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> We are bring back the KIP-280 to live with small correct for the
> > > >> discussion & voting. Thanks to previous author Luis Cabral on the
> > > >> KIP-280 initiation and we are taking over to complete and get it
> > > >> into
> > > 2.4...
> > > >>
> > > >> Below is the correction that we made to the existing KIP-280:
> > > >>
> > > >>   *   Allowing the compact strategy configuration at the topic level
> > as
> > > >> the log compaction is at the topic level and a broker can have
> > > >> multiple topics. This allows the flexibility to have the strategy
> > > >> at both broker level (i.e. for all topics within the broker) and
> > > >> topic level (i.e. for a subset of topics within a broker) as well...
> > > >>
> > > >> KIP-280:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > >> Fc
> > > >> wi
> > > >> k
> > > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnha
> > > >> nc
> > > >> ed
> > > >> %
> > > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C
> > > >> 68
> > > >> 6c
> > > >> 3
> > > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C
> > > >> 1%
> > > >> 7C
> > > >> 0
> > > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeA
> > > >> a7
> > > >> Gs
> > > >> 6
> > > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > >> Fg
> > > >> it
> > > >> h
> > > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm
> > > >> %4
> > > >> 0m
> > > >> i
> > > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af
> > > >> 91
> > > >> ab
> > > >> 2
> > > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXo
> > > >> hE
> > > >> Wp
> > > >> t
> > > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
> > > >> progress)
> > > >>
> > > >> Previous Thread DISCUSS:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > >> Fl
> > > >> is
> > > >> t
> > > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a
> > > >> 1a
> > > >> fa
> > > >> 5
> > > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01
> > > >> %7
> > > >> Cs
> > > >> e
> > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f98
> > > >> 8b
> > > >> f8
> > > >> 6
> > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=Xwc
> > > >> UW
> > > >> WY
> > > >> D
> > > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > > >> Previous Thread VOTE:
> > > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > >> Fl
> > > >> is
> > > >> t
> > > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f7650583
> > > >> 49
> > > >> 78
> > > >> 1
> > > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01
> > > >> %7
> > > >> Cs
> > > >> e
> > > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f98
> > > >> 8b
> > > >> f8
> > > >> 6
> > > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cK
> > > >> Qc
> > > >> Am
> > > >> 2
> > > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > > >>
> > > >> Appreciate your timely action.
> > > >>
> > > >> PS: Initiating a separate thread as I was not able to reply to
> > > >> the existing threads...
> > > >>
> > > >> Thanks,
> > > >> Senthil
> > > >>
> > >
> > >
> >
> > --
> > -- Guozhang
> >
>

RE: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

Hi Jun,

Thanks for the response and please find below the response!

#50 - got it...

#51 - not sure how the last record will be deleted bcoz of this new compact strategy. The reason I am asking is, the compaction is based out of offsetmap and the new strategy logic is purely within the offsetmap... the offsetmap will always keep track of the latest offset irrespective of the compaction strategy. You can have a look at the PR of the new compaction strategy changes: https://github.com/apache/kafka/pull/7528/files 

#52 - sure, I have updated JIRA to include this details in the wiki.

#53 - as I am pointed out in #51, the tombstone is abstract to this change (i.e. the tombstone is handled within LogCleaner and the compact strategy is by the offsetmap). this is what my understand on the tombstone based on the code walk-thru... please let me know if I am missing anything here...

Thanks,
Senthil

-----Original Message-----
From: Jun Rao <ju...@confluent.io> 
Sent: Thursday, November 7, 2019 4:32 PM
To: dev <de...@kafka.apache.org>
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Hi, Senthil,

Thanks for bringing back this KIP. Overall, this seems like a useful feature. A few comments below.

50. One use case for the timestamp based compaction is to resolve conflicts during data center failures. The failover of a data center typically happens much longer tha millisec. So, timestamp could be enough to determine the value to keep.

51. With the timestamp/header strategy, it seems that it may now be possible that the last record could be removed during compaction. For example, if the active segment is empty, the last record in the previous segment could be removed due to compaction. A new replica then won't see the true end offset of the partition. If that replica ever becomes the leader, it could write a different record on the same end offset, which will be weird.

52. With the timestamp/header strategy, the behavior of the application may need to change. In particular, the application can't just blindly take the record with a larger offset and assuming that it's the value to keep. It needs to check the timestamp or the header now. So, it would be useful to at least document this.

53. This also adds complexity for deletes. Currently, we use a null payload to indicate a delete tombstone. The tombstone can be removed once all previous records with the same key have been removed. If the new strategies apply to tombstones, it's not clear when a tombstone can be removed since subsequent records could have timestamp/sequenceId smaller than that in the tombstone. It would be useful to think this through and document the expected behavior.

Jun

On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:

> Hi Guozhang,
>
> Sure and I have made a note in the JIRA item to make sure the wiki is 
> updated.
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Guozhang Wang <wa...@gmail.com>
> Sent: Monday, November 4, 2019 11:00 AM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hello Senthilnathan,
>
> Thanks for revamping on the KIP. I have only one comment about the 
> wiki otherwise LGTM.
>
> 1. We should emphasize that the newly introduced config yields to the 
> existing "log.cleanup.policy", i.e. if the latter's value is `delete` 
> not `compact`, then the previous config would be ignored.
>
>
> Guozhang
>
> On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy < 
> senthilm@microsoft.com.invalid> wrote:
>
> > Hi all,
> >
> > I will start the vote thread shortly for this updated KIP. If there 
> > are any more thoughts I would love to hear them.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > Sent: Thursday, October 31, 2019 3:51 AM
> > To: dev@kafka.apache.org
> > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi Matthias
> >
> > Thanks for the response.
> >
> > (1) Yes
> >
> > (2) Yes, and the config name will be the same (i.e.
> > `log.cleaner.compaction.strategy` &
> > `log.cleaner.compaction.strategy.header`) at broker level and topic 
> > level (to override broker level default compact strategy). Please 
> > let me know if we need to keep it in different naming convention. Note:
> > Broker level (which will be in the server.properties) configuration 
> > is optional and default it to offset. Topic level configuration will 
> > be default to broker level config...
> >
> > (3) By this new way, it avoids another config parameter and also in 
> > feature if any new strategy like header need addition info, no 
> > additional config required. As this got discussed already and agreed 
> > to have separate config, I will revert it. KIP updated...
> >
> > (4) Done
> >
> > (5) Updated
> >
> > (6) Updated to pick the first header in the list
> >
> > Please let me know if you have any other questions.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Matthias J. Sax <ma...@confluent.io>
> > Sent: Thursday, October 31, 2019 12:13 AM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Thanks for picking up this KIP, Senthil.
> >
> > (1) As far as I remember, the main issue of the original proposal 
> > was a missing topic level configuration for the compaction strategy. 
> > With this being addressed, I am in favor of this KIP.
> >
> > (2) With regard to (1), it seems we would need a new topic level 
> > config `compaction.strategy`, and `log.cleaner.compaction.strategy` 
> > would be the default strategy (ie, broker level config) if a topic 
> > does
> not overwrite it?
> >
> > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > parameter and change the accepted values of 
> > `log.cleaner.compaction.strategy` to "header.<key>" instead of 
> > keeping "header"? The original approach seems to be cleaner, and I 
> > think this was discussed on the original discuss thread already.
> >
> > (4) Nit: For the "timestamp" compaction strategy you changed the KIP 
> > to
> >
> > -> `The record [create] timestamp`
> >
> > This is miss leading IMHO, because it depends on the broker/log 
> > configuration `(log.)message.timestamp.type` that can either be 
> > `CreateTime` or `LogAppendTime` what the actual record timestamp is. 
> > I would just remove "create" to keep it unspecified.
> >
> > (5) Nit: the section "Public Interfaces" should list the newly 
> > introduced configs -- configuration parameters are a public interface.
> >
> > (6) What do you mean by "first level header lookup"? The term "first 
> > level" indicates some hierarchy, but headers don't have any 
> > hierarchy
> > -- it's just a list of key-value pairs? If you mean the _order_ of 
> > the headers, ie, pick the first header in the list that matches the 
> > key, please rephrase it to make it clearer.
> >
> >
> >
> > @Tom: I agree with all you are saying, however, I still think that 
> > this KIP will improve the overall situation, because everything you 
> > pointed out is actually true with offset based compaction, too.
> >
> > The KIP is not a silver bullet that solves all issue for interleaved 
> > writes, but I personally believe, it's a good improvement.
> >
> >
> >
> > -Matthias
> >
> >
> > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > Hi,
> > >
> > > Please let me know if anyone has any questions on this updated
> KIP-280...
> > >
> > > Thanks,
> > >
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > Sent: Monday, October 28, 2019 11:36 PM
> > > To: dev@kafka.apache.org
> > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Tom,
> > >
> > > Sorry for the delayed response.
> > >
> > > Regarding the fall back to offset decision for both timestamp & 
> > > header
> > value is based on the previous author discuss 
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
> > st 
> > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbcc
> > dd 
> > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7C
> > se
> > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f988bf
> > 86 
> > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=WpEW5y
> > lu
> > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > and as per the discussion, it is really required to avoid duplicates.
> > >
> > > And the timestamp strategy is from the original KIP author and we 
> > > are
> > keeping it as is.
> > >
> > > Finally on the sequence order guarantee by the producer, it is not
> > feasible on waiting for ack in async / multi-threads/processes 
> > scenarios and hence the header sequence based compact strategy with 
> > producer's responsibility to have a unique sequence generation for 
> > the topic-partition-key level.
> > >
> > > Hoping this clarifies all your questions. Please let us know if 
> > > you have
> > any further questions.
> > >
> > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > discussion on the original KIP with previous author and it would 
> > great to hear your inputs as well.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Tom Bentley <tb...@redhat.com>
> > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Senthilnathan,
> > >
> > > In the motivation isn't it a little misleading to say "On the 
> > > producer side, we clearly preserve an order for the two messages, 
> > > <K1, V1> <K1,
> > > V2>"? IMHO, the semantics of the producer are clear that having an 
> > > V2>observed
> > > order of sending records from different producers is not 
> > > sufficient to
> > guarantee ordering on the broker. You really need to send the 2nd 
> > record only after the 1st record is acked. It's the difficultly of 
> > achieving that in practice that's the true motivation for your KIP.
> > >
> > > I can see the attraction of using timestamps, but it would be 
> > > helpful to
> > explain how that really solves the problem. When the producers are 
> > in different processes on different machines you're relying on their 
> > clocks being synchronized, which is a whole subject in itself. Even 
> > if they're synchronized the resolution of System.currentTimeMillis() 
> > is typically many milliseconds. If your producers are in different 
> > threads of the same process that could be a real problem because it
> makes ties quite likely.
> > > And you don't explain why it's OK to resolve ties using the offset.
> > > The
> > basis of your argument is that the offset is giving you the wrong answer.
> > > So it seems to me that using it as a tiebreaker is just narrowing 
> > > the
> > chances of getting the wrong answer. Maybe none of this matters for 
> > your use case, but I think it should be spelled out in the KIP, 
> > because it surely would matter for similar use cases.
> > >
> > > Using a sequence at least removes the problem of ties, but the
> > interesting bit is now in how you deal with races between 
> > threads/processes in getting a sequence number allocated (which is 
> > out of scope of the KIP, I guess).
> > > How is resolving that race any simpler that resolving the 
> > > motivating
> > race by waiting for the ack of the first record sent?
> > >
> > > Kind regards,
> > >
> > > Tom
> > >
> > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > senthilm@microsoft.com.invalid> wrote:
> > >
> > >> Hi All,
> > >>
> > >> We are bring back the KIP-280 to live with small correct for the 
> > >> discussion & voting. Thanks to previous author Luis Cabral on the
> > >> KIP-280 initiation and we are taking over to complete and get it 
> > >> into
> > 2.4...
> > >>
> > >> Below is the correction that we made to the existing KIP-280:
> > >>
> > >>   *   Allowing the compact strategy configuration at the topic level
> as
> > >> the log compaction is at the topic level and a broker can have 
> > >> multiple topics. This allows the flexibility to have the strategy 
> > >> at both broker level (i.e. for all topics within the broker) and 
> > >> topic level (i.e. for a subset of topics within a broker) as well...
> > >>
> > >> KIP-280:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > >> Fc
> > >> wi
> > >> k
> > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnha
> > >> nc
> > >> ed
> > >> %
> > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C
> > >> 68
> > >> 6c
> > >> 3
> > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C
> > >> 1%
> > >> 7C
> > >> 0
> > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeA
> > >> a7
> > >> Gs
> > >> 6
> > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > >> Fg
> > >> it
> > >> h
> > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm
> > >> %4
> > >> 0m
> > >> i
> > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af
> > >> 91
> > >> ab
> > >> 2
> > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXo
> > >> hE
> > >> Wp
> > >> t
> > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
> > >> progress)
> > >>
> > >> Previous Thread DISCUSS:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > >> Fl
> > >> is
> > >> t
> > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a
> > >> 1a
> > >> fa
> > >> 5
> > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01
> > >> %7
> > >> Cs
> > >> e
> > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f98
> > >> 8b
> > >> f8
> > >> 6
> > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=Xwc
> > >> UW
> > >> WY
> > >> D
> > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > >> Previous Thread VOTE:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > >> Fl
> > >> is
> > >> t
> > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f7650583
> > >> 49
> > >> 78
> > >> 1
> > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01
> > >> %7
> > >> Cs
> > >> e
> > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f98
> > >> 8b
> > >> f8
> > >> 6
> > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cK
> > >> Qc
> > >> Am
> > >> 2
> > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > >>
> > >> Appreciate your timely action.
> > >>
> > >> PS: Initiating a separate thread as I was not able to reply to 
> > >> the existing threads...
> > >>
> > >> Thanks,
> > >> Senthil
> > >>
> >
> >
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Jun Rao <ju...@confluent.io>.

Hi, Senthil,

Thanks for bringing back this KIP. Overall, this seems like a useful
feature. A few comments below.

50. One use case for the timestamp based compaction is to resolve conflicts
during data center failures. The failover of a data center typically
happens much longer tha millisec. So, timestamp could be enough to
determine the value to keep.

51. With the timestamp/header strategy, it seems that it may now be
possible that the last record could be removed during compaction. For
example, if the active segment is empty, the last record in the previous
segment could be removed due to compaction. A new replica then won't see
the true end offset of the partition. If that replica ever becomes the
leader, it could write a different record on the same end offset, which
will be weird.

52. With the timestamp/header strategy, the behavior of the application may
need to change. In particular, the application can't just blindly take the
record with a larger offset and assuming that it's the value to keep. It
needs to check the timestamp or the header now. So, it would be useful to
at least document this.

53. This also adds complexity for deletes. Currently, we use a null payload
to indicate a delete tombstone. The tombstone can be removed once all
previous records with the same key have been removed. If the new strategies
apply to tombstones, it's not clear when a tombstone can be removed since
subsequent records could have timestamp/sequenceId smaller than that in the
tombstone. It would be useful to think this through and document the
expected behavior.

Jun

On Tue, Nov 5, 2019 at 11:37 AM Senthilnathan Muthusamy
<se...@microsoft.com.invalid> wrote:

> Hi Guozhang,
>
> Sure and I have made a note in the JIRA item to make sure the wiki is
> updated.
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Guozhang Wang <wa...@gmail.com>
> Sent: Monday, November 4, 2019 11:00 AM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hello Senthilnathan,
>
> Thanks for revamping on the KIP. I have only one comment about the wiki
> otherwise LGTM.
>
> 1. We should emphasize that the newly introduced config yields to the
> existing "log.cleanup.policy", i.e. if the latter's value is `delete` not
> `compact`, then the previous config would be ignored.
>
>
> Guozhang
>
> On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy <
> senthilm@microsoft.com.invalid> wrote:
>
> > Hi all,
> >
> > I will start the vote thread shortly for this updated KIP. If there
> > are any more thoughts I would love to hear them.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > Sent: Thursday, October 31, 2019 3:51 AM
> > To: dev@kafka.apache.org
> > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi Matthias
> >
> > Thanks for the response.
> >
> > (1) Yes
> >
> > (2) Yes, and the config name will be the same (i.e.
> > `log.cleaner.compaction.strategy` &
> > `log.cleaner.compaction.strategy.header`) at broker level and topic
> > level (to override broker level default compact strategy). Please let
> > me know if we need to keep it in different naming convention. Note:
> > Broker level (which will be in the server.properties) configuration is
> > optional and default it to offset. Topic level configuration will be
> > default to broker level config...
> >
> > (3) By this new way, it avoids another config parameter and also in
> > feature if any new strategy like header need addition info, no
> > additional config required. As this got discussed already and agreed
> > to have separate config, I will revert it. KIP updated...
> >
> > (4) Done
> >
> > (5) Updated
> >
> > (6) Updated to pick the first header in the list
> >
> > Please let me know if you have any other questions.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Matthias J. Sax <ma...@confluent.io>
> > Sent: Thursday, October 31, 2019 12:13 AM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Thanks for picking up this KIP, Senthil.
> >
> > (1) As far as I remember, the main issue of the original proposal was
> > a missing topic level configuration for the compaction strategy. With
> > this being addressed, I am in favor of this KIP.
> >
> > (2) With regard to (1), it seems we would need a new topic level
> > config `compaction.strategy`, and `log.cleaner.compaction.strategy`
> > would be the default strategy (ie, broker level config) if a topic does
> not overwrite it?
> >
> > (3) Why did you remove `log.cleaner.compaction.strategy.header`
> > parameter and change the accepted values of
> > `log.cleaner.compaction.strategy` to "header.<key>" instead of keeping
> > "header"? The original approach seems to be cleaner, and I think this
> > was discussed on the original discuss thread already.
> >
> > (4) Nit: For the "timestamp" compaction strategy you changed the KIP
> > to
> >
> > -> `The record [create] timestamp`
> >
> > This is miss leading IMHO, because it depends on the broker/log
> > configuration `(log.)message.timestamp.type` that can either be
> > `CreateTime` or `LogAppendTime` what the actual record timestamp is. I
> > would just remove "create" to keep it unspecified.
> >
> > (5) Nit: the section "Public Interfaces" should list the newly
> > introduced configs -- configuration parameters are a public interface.
> >
> > (6) What do you mean by "first level header lookup"? The term "first
> > level" indicates some hierarchy, but headers don't have any hierarchy
> > -- it's just a list of key-value pairs? If you mean the _order_ of the
> > headers, ie, pick the first header in the list that matches the key,
> > please rephrase it to make it clearer.
> >
> >
> >
> > @Tom: I agree with all you are saying, however, I still think that
> > this KIP will improve the overall situation, because everything you
> > pointed out is actually true with offset based compaction, too.
> >
> > The KIP is not a silver bullet that solves all issue for interleaved
> > writes, but I personally believe, it's a good improvement.
> >
> >
> >
> > -Matthias
> >
> >
> > On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > > Hi,
> > >
> > > Please let me know if anyone has any questions on this updated
> KIP-280...
> > >
> > > Thanks,
> > >
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > > Sent: Monday, October 28, 2019 11:36 PM
> > > To: dev@kafka.apache.org
> > > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Tom,
> > >
> > > Sorry for the delayed response.
> > >
> > > Regarding the fall back to offset decision for both timestamp &
> > > header
> > value is based on the previous author discuss
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> > s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd
> > 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cse
> > nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f988bf86
> > f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=WpEW5ylu
> > %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> > and as per the discussion, it is really required to avoid duplicates.
> > >
> > > And the timestamp strategy is from the original KIP author and we
> > > are
> > keeping it as is.
> > >
> > > Finally on the sequence order guarantee by the producer, it is not
> > feasible on waiting for ack in async / multi-threads/processes
> > scenarios and hence the header sequence based compact strategy with
> > producer's responsibility to have a unique sequence generation for the
> > topic-partition-key level.
> > >
> > > Hoping this clarifies all your questions. Please let us know if you
> > > have
> > any further questions.
> > >
> > > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> > discussion on the original KIP with previous author and it would great
> > to hear your inputs as well.
> > >
> > > Thanks,
> > > Senthil
> > >
> > > -----Original Message-----
> > > From: Tom Bentley <tb...@redhat.com>
> > > Sent: Tuesday, October 22, 2019 2:32 AM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> > >
> > > Hi Senthilnathan,
> > >
> > > In the motivation isn't it a little misleading to say "On the
> > > producer side, we clearly preserve an order for the two messages,
> > > <K1, V1> <K1,
> > > V2>"? IMHO, the semantics of the producer are clear that having an
> > > V2>observed
> > > order of sending records from different producers is not sufficient
> > > to
> > guarantee ordering on the broker. You really need to send the 2nd
> > record only after the 1st record is acked. It's the difficultly of
> > achieving that in practice that's the true motivation for your KIP.
> > >
> > > I can see the attraction of using timestamps, but it would be
> > > helpful to
> > explain how that really solves the problem. When the producers are in
> > different processes on different machines you're relying on their
> > clocks being synchronized, which is a whole subject in itself. Even if
> > they're synchronized the resolution of System.currentTimeMillis() is
> > typically many milliseconds. If your producers are in different
> > threads of the same process that could be a real problem because it
> makes ties quite likely.
> > > And you don't explain why it's OK to resolve ties using the offset.
> > > The
> > basis of your argument is that the offset is giving you the wrong answer.
> > > So it seems to me that using it as a tiebreaker is just narrowing
> > > the
> > chances of getting the wrong answer. Maybe none of this matters for
> > your use case, but I think it should be spelled out in the KIP,
> > because it surely would matter for similar use cases.
> > >
> > > Using a sequence at least removes the problem of ties, but the
> > interesting bit is now in how you deal with races between
> > threads/processes in getting a sequence number allocated (which is out
> > of scope of the KIP, I guess).
> > > How is resolving that race any simpler that resolving the motivating
> > race by waiting for the ack of the first record sent?
> > >
> > > Kind regards,
> > >
> > > Tom
> > >
> > > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> > senthilm@microsoft.com.invalid> wrote:
> > >
> > >> Hi All,
> > >>
> > >> We are bring back the KIP-280 to live with small correct for the
> > >> discussion & voting. Thanks to previous author Luis Cabral on the
> > >> KIP-280 initiation and we are taking over to complete and get it
> > >> into
> > 2.4...
> > >>
> > >> Below is the correction that we made to the existing KIP-280:
> > >>
> > >>   *   Allowing the compact strategy configuration at the topic level
> as
> > >> the log compaction is at the topic level and a broker can have
> > >> multiple topics. This allows the flexibility to have the strategy
> > >> at both broker level (i.e. for all topics within the broker) and
> > >> topic level (i.e. for a subset of topics within a broker) as well...
> > >>
> > >> KIP-280:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fc
> > >> wi
> > >> k
> > >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanc
> > >> ed
> > >> %
> > >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C68
> > >> 6c
> > >> 3
> > >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%
> > >> 7C
> > >> 0
> > >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7
> > >> Gs
> > >> 6
> > >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
> > >> it
> > >> h
> > >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%4
> > >> 0m
> > >> i
> > >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91
> > >> ab
> > >> 2
> > >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohE
> > >> Wp
> > >> t
> > >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
> > >> progress)
> > >>
> > >> Previous Thread DISCUSS:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
> > >> is
> > >> t
> > >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1a
> > >> fa
> > >> 5
> > >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7
> > >> Cs
> > >> e
> > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988b
> > >> f8
> > >> 6
> > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUW
> > >> WY
> > >> D
> > >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> > >> Previous Thread VOTE:
> > >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
> > >> is
> > >> t
> > >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f765058349
> > >> 78
> > >> 1
> > >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7
> > >> Cs
> > >> e
> > >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988b
> > >> f8
> > >> 6
> > >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQc
> > >> Am
> > >> 2
> > >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> > >>
> > >> Appreciate your timely action.
> > >>
> > >> PS: Initiating a separate thread as I was not able to reply to the
> > >> existing threads...
> > >>
> > >> Thanks,
> > >> Senthil
> > >>
> >
> >
>
> --
> -- Guozhang
>

RE: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

Hi Guozhang,

Sure and I have made a note in the JIRA item to make sure the wiki is updated.

Thanks,
Senthil

-----Original Message-----
From: Guozhang Wang <wa...@gmail.com> 
Sent: Monday, November 4, 2019 11:00 AM
To: dev <de...@kafka.apache.org>
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Hello Senthilnathan,

Thanks for revamping on the KIP. I have only one comment about the wiki otherwise LGTM.

1. We should emphasize that the newly introduced config yields to the existing "log.cleanup.policy", i.e. if the latter's value is `delete` not `compact`, then the previous config would be ignored.


Guozhang

On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:

> Hi all,
>
> I will start the vote thread shortly for this updated KIP. If there 
> are any more thoughts I would love to hear them.
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> Sent: Thursday, October 31, 2019 3:51 AM
> To: dev@kafka.apache.org
> Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hi Matthias
>
> Thanks for the response.
>
> (1) Yes
>
> (2) Yes, and the config name will be the same (i.e.
> `log.cleaner.compaction.strategy` &
> `log.cleaner.compaction.strategy.header`) at broker level and topic 
> level (to override broker level default compact strategy). Please let 
> me know if we need to keep it in different naming convention. Note: 
> Broker level (which will be in the server.properties) configuration is 
> optional and default it to offset. Topic level configuration will be 
> default to broker level config...
>
> (3) By this new way, it avoids another config parameter and also in 
> feature if any new strategy like header need addition info, no 
> additional config required. As this got discussed already and agreed 
> to have separate config, I will revert it. KIP updated...
>
> (4) Done
>
> (5) Updated
>
> (6) Updated to pick the first header in the list
>
> Please let me know if you have any other questions.
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Matthias J. Sax <ma...@confluent.io>
> Sent: Thursday, October 31, 2019 12:13 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Thanks for picking up this KIP, Senthil.
>
> (1) As far as I remember, the main issue of the original proposal was 
> a missing topic level configuration for the compaction strategy. With 
> this being addressed, I am in favor of this KIP.
>
> (2) With regard to (1), it seems we would need a new topic level 
> config `compaction.strategy`, and `log.cleaner.compaction.strategy` 
> would be the default strategy (ie, broker level config) if a topic does not overwrite it?
>
> (3) Why did you remove `log.cleaner.compaction.strategy.header`
> parameter and change the accepted values of 
> `log.cleaner.compaction.strategy` to "header.<key>" instead of keeping 
> "header"? The original approach seems to be cleaner, and I think this 
> was discussed on the original discuss thread already.
>
> (4) Nit: For the "timestamp" compaction strategy you changed the KIP 
> to
>
> -> `The record [create] timestamp`
>
> This is miss leading IMHO, because it depends on the broker/log 
> configuration `(log.)message.timestamp.type` that can either be 
> `CreateTime` or `LogAppendTime` what the actual record timestamp is. I 
> would just remove "create" to keep it unspecified.
>
> (5) Nit: the section "Public Interfaces" should list the newly 
> introduced configs -- configuration parameters are a public interface.
>
> (6) What do you mean by "first level header lookup"? The term "first 
> level" indicates some hierarchy, but headers don't have any hierarchy 
> -- it's just a list of key-value pairs? If you mean the _order_ of the 
> headers, ie, pick the first header in the list that matches the key, 
> please rephrase it to make it clearer.
>
>
>
> @Tom: I agree with all you are saying, however, I still think that 
> this KIP will improve the overall situation, because everything you 
> pointed out is actually true with offset based compaction, too.
>
> The KIP is not a silver bullet that solves all issue for interleaved 
> writes, but I personally believe, it's a good improvement.
>
>
>
> -Matthias
>
>
> On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > Hi,
> >
> > Please let me know if anyone has any questions on this updated KIP-280...
> >
> > Thanks,
> >
> > Senthil
> >
> > -----Original Message-----
> > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > Sent: Monday, October 28, 2019 11:36 PM
> > To: dev@kafka.apache.org
> > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi Tom,
> >
> > Sorry for the delayed response.
> >
> > Regarding the fall back to offset decision for both timestamp & 
> > header
> value is based on the previous author discuss
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd
> 02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cse
> nthilm%40microsoft.com%7C8f6cae776082459c793b08d761595294%7C72f988bf86
> f141af91ab2d7cd011db47%7C1%7C0%7C637085022516423400&amp;sdata=WpEW5ylu
> %2FsLMyGS2ULWDZ7vA1OzQwFYWSuioLCbABhM%3D&amp;reserved=0
> and as per the discussion, it is really required to avoid duplicates.
> >
> > And the timestamp strategy is from the original KIP author and we 
> > are
> keeping it as is.
> >
> > Finally on the sequence order guarantee by the producer, it is not
> feasible on waiting for ack in async / multi-threads/processes 
> scenarios and hence the header sequence based compact strategy with 
> producer's responsibility to have a unique sequence generation for the 
> topic-partition-key level.
> >
> > Hoping this clarifies all your questions. Please let us know if you 
> > have
> any further questions.
> >
> > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> discussion on the original KIP with previous author and it would great 
> to hear your inputs as well.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Tom Bentley <tb...@redhat.com>
> > Sent: Tuesday, October 22, 2019 2:32 AM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi Senthilnathan,
> >
> > In the motivation isn't it a little misleading to say "On the 
> > producer side, we clearly preserve an order for the two messages, 
> > <K1, V1> <K1,
> > V2>"? IMHO, the semantics of the producer are clear that having an 
> > V2>observed
> > order of sending records from different producers is not sufficient 
> > to
> guarantee ordering on the broker. You really need to send the 2nd 
> record only after the 1st record is acked. It's the difficultly of 
> achieving that in practice that's the true motivation for your KIP.
> >
> > I can see the attraction of using timestamps, but it would be 
> > helpful to
> explain how that really solves the problem. When the producers are in 
> different processes on different machines you're relying on their 
> clocks being synchronized, which is a whole subject in itself. Even if 
> they're synchronized the resolution of System.currentTimeMillis() is 
> typically many milliseconds. If your producers are in different 
> threads of the same process that could be a real problem because it makes ties quite likely.
> > And you don't explain why it's OK to resolve ties using the offset. 
> > The
> basis of your argument is that the offset is giving you the wrong answer.
> > So it seems to me that using it as a tiebreaker is just narrowing 
> > the
> chances of getting the wrong answer. Maybe none of this matters for 
> your use case, but I think it should be spelled out in the KIP, 
> because it surely would matter for similar use cases.
> >
> > Using a sequence at least removes the problem of ties, but the
> interesting bit is now in how you deal with races between 
> threads/processes in getting a sequence number allocated (which is out 
> of scope of the KIP, I guess).
> > How is resolving that race any simpler that resolving the motivating
> race by waiting for the ack of the first record sent?
> >
> > Kind regards,
> >
> > Tom
> >
> > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> senthilm@microsoft.com.invalid> wrote:
> >
> >> Hi All,
> >>
> >> We are bring back the KIP-280 to live with small correct for the 
> >> discussion & voting. Thanks to previous author Luis Cabral on the
> >> KIP-280 initiation and we are taking over to complete and get it 
> >> into
> 2.4...
> >>
> >> Below is the correction that we made to the existing KIP-280:
> >>
> >>   *   Allowing the compact strategy configuration at the topic level as
> >> the log compaction is at the topic level and a broker can have 
> >> multiple topics. This allows the flexibility to have the strategy 
> >> at both broker level (i.e. for all topics within the broker) and 
> >> topic level (i.e. for a subset of topics within a broker) as well...
> >>
> >> KIP-280:
> >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fc
> >> wi
> >> k
> >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanc
> >> ed
> >> %
> >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C68
> >> 6c
> >> 3
> >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%
> >> 7C
> >> 0
> >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7
> >> Gs
> >> 6
> >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
> >> it
> >> h
> >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%4
> >> 0m
> >> i
> >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91
> >> ab
> >> 2
> >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohE
> >> Wp
> >> t
> >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
> >> progress)
> >>
> >> Previous Thread DISCUSS:
> >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
> >> is
> >> t
> >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1a
> >> fa
> >> 5
> >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7
> >> Cs
> >> e
> >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988b
> >> f8
> >> 6
> >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUW
> >> WY
> >> D
> >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> >> Previous Thread VOTE:
> >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
> >> is
> >> t
> >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f765058349
> >> 78
> >> 1
> >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7
> >> Cs
> >> e
> >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988b
> >> f8
> >> 6
> >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQc
> >> Am
> >> 2
> >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> >>
> >> Appreciate your timely action.
> >>
> >> PS: Initiating a separate thread as I was not able to reply to the 
> >> existing threads...
> >>
> >> Thanks,
> >> Senthil
> >>
>
>

--
-- Guozhang

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Senthilnathan,

Thanks for revamping on the KIP. I have only one comment about the wiki
otherwise LGTM.

1. We should emphasize that the newly introduced config yields to the
existing "log.cleanup.policy", i.e. if the latter's value is `delete` not
`compact`, then the previous config would be ignored.


Guozhang

On Mon, Nov 4, 2019 at 9:52 AM Senthilnathan Muthusamy
<se...@microsoft.com.invalid> wrote:

> Hi all,
>
> I will start the vote thread shortly for this updated KIP. If there are
> any more thoughts I would love to hear them.
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> Sent: Thursday, October 31, 2019 3:51 AM
> To: dev@kafka.apache.org
> Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
>
> Hi Matthias
>
> Thanks for the response.
>
> (1) Yes
>
> (2) Yes, and the config name will be the same (i.e.
> `log.cleaner.compaction.strategy` &
> `log.cleaner.compaction.strategy.header`) at broker level and topic level
> (to override broker level default compact strategy). Please let me know if
> we need to keep it in different naming convention. Note: Broker level
> (which will be in the server.properties) configuration is optional and
> default it to offset. Topic level configuration will be default to broker
> level config...
>
> (3) By this new way, it avoids another config parameter and also in
> feature if any new strategy like header need addition info, no additional
> config required. As this got discussed already and agreed to have separate
> config, I will revert it. KIP updated...
>
> (4) Done
>
> (5) Updated
>
> (6) Updated to pick the first header in the list
>
> Please let me know if you have any other questions.
>
> Thanks,
> Senthil
>
> -----Original Message-----
> From: Matthias J. Sax <ma...@confluent.io>
> Sent: Thursday, October 31, 2019 12:13 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
>
> Thanks for picking up this KIP, Senthil.
>
> (1) As far as I remember, the main issue of the original proposal was a
> missing topic level configuration for the compaction strategy. With this
> being addressed, I am in favor of this KIP.
>
> (2) With regard to (1), it seems we would need a new topic level config
> `compaction.strategy`, and `log.cleaner.compaction.strategy` would be the
> default strategy (ie, broker level config) if a topic does not overwrite it?
>
> (3) Why did you remove `log.cleaner.compaction.strategy.header`
> parameter and change the accepted values of
> `log.cleaner.compaction.strategy` to "header.<key>" instead of keeping
> "header"? The original approach seems to be cleaner, and I think this was
> discussed on the original discuss thread already.
>
> (4) Nit: For the "timestamp" compaction strategy you changed the KIP to
>
> -> `The record [create] timestamp`
>
> This is miss leading IMHO, because it depends on the broker/log
> configuration `(log.)message.timestamp.type` that can either be
> `CreateTime` or `LogAppendTime` what the actual record timestamp is. I
> would just remove "create" to keep it unspecified.
>
> (5) Nit: the section "Public Interfaces" should list the newly introduced
> configs -- configuration parameters are a public interface.
>
> (6) What do you mean by "first level header lookup"? The term "first
> level" indicates some hierarchy, but headers don't have any hierarchy --
> it's just a list of key-value pairs? If you mean the _order_ of the
> headers, ie, pick the first header in the list that matches the key, please
> rephrase it to make it clearer.
>
>
>
> @Tom: I agree with all you are saying, however, I still think that this
> KIP will improve the overall situation, because everything you pointed out
> is actually true with offset based compaction, too.
>
> The KIP is not a silver bullet that solves all issue for interleaved
> writes, but I personally believe, it's a good improvement.
>
>
>
> -Matthias
>
>
> On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> > Hi,
> >
> > Please let me know if anyone has any questions on this updated KIP-280...
> >
> > Thanks,
> >
> > Senthil
> >
> > -----Original Message-----
> > From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> > Sent: Monday, October 28, 2019 11:36 PM
> > To: dev@kafka.apache.org
> > Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi Tom,
> >
> > Sorry for the delayed response.
> >
> > Regarding the fall back to offset decision for both timestamp & header
> value is based on the previous author discuss
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Csenthilm%40microsoft.com%7Cb5c596140be1436e9fb708d75df04714%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637081159484181661&amp;sdata=%2Fap4F2CdPQe02wNDGkzjzIrxOQRTa2KraQE75dpjTzE%3D&amp;reserved=0
> and as per the discussion, it is really required to avoid duplicates.
> >
> > And the timestamp strategy is from the original KIP author and we are
> keeping it as is.
> >
> > Finally on the sequence order guarantee by the producer, it is not
> feasible on waiting for ack in async / multi-threads/processes scenarios
> and hence the header sequence based compact strategy with producer's
> responsibility to have a unique sequence generation for the
> topic-partition-key level.
> >
> > Hoping this clarifies all your questions. Please let us know if you have
> any further questions.
> >
> > @Guozhang Wang / @Matthias J. Sax, I see you both had a detail
> discussion on the original KIP with previous author and it would great to
> hear your inputs as well.
> >
> > Thanks,
> > Senthil
> >
> > -----Original Message-----
> > From: Tom Bentley <tb...@redhat.com>
> > Sent: Tuesday, October 22, 2019 2:32 AM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> >
> > Hi Senthilnathan,
> >
> > In the motivation isn't it a little misleading to say "On the producer
> > side, we clearly preserve an order for the two messages, <K1, V1> <K1,
> > V2>"? IMHO, the semantics of the producer are clear that having an
> > V2>observed
> > order of sending records from different producers is not sufficient to
> guarantee ordering on the broker. You really need to send the 2nd record
> only after the 1st record is acked. It's the difficultly of achieving that
> in practice that's the true motivation for your KIP.
> >
> > I can see the attraction of using timestamps, but it would be helpful to
> explain how that really solves the problem. When the producers are in
> different processes on different machines you're relying on their clocks
> being synchronized, which is a whole subject in itself. Even if they're
> synchronized the resolution of System.currentTimeMillis() is typically many
> milliseconds. If your producers are in different threads of the same
> process that could be a real problem because it makes ties quite likely.
> > And you don't explain why it's OK to resolve ties using the offset. The
> basis of your argument is that the offset is giving you the wrong answer.
> > So it seems to me that using it as a tiebreaker is just narrowing the
> chances of getting the wrong answer. Maybe none of this matters for your
> use case, but I think it should be spelled out in the KIP, because it
> surely would matter for similar use cases.
> >
> > Using a sequence at least removes the problem of ties, but the
> interesting bit is now in how you deal with races between threads/processes
> in getting a sequence number allocated (which is out of scope of the KIP, I
> guess).
> > How is resolving that race any simpler that resolving the motivating
> race by waiting for the ack of the first record sent?
> >
> > Kind regards,
> >
> > Tom
> >
> > On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <
> senthilm@microsoft.com.invalid> wrote:
> >
> >> Hi All,
> >>
> >> We are bring back the KIP-280 to live with small correct for the
> >> discussion & voting. Thanks to previous author Luis Cabral on the
> >> KIP-280 initiation and we are taking over to complete and get it into
> 2.4...
> >>
> >> Below is the correction that we made to the existing KIP-280:
> >>
> >>   *   Allowing the compact strategy configuration at the topic level as
> >> the log compaction is at the topic level and a broker can have
> >> multiple topics. This allows the flexibility to have the strategy at
> >> both broker level (i.e. for all topics within the broker) and topic
> >> level (i.e. for a subset of topics within a broker) as well...
> >>
> >> KIP-280:
> >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwi
> >> k
> >> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanced
> >> %
> >> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C686c
> >> 3
> >> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C
> >> 0
> >> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7Gs
> >> 6
> >> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST:
> >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> >> h
> >> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%40m
> >> i
> >> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab
> >> 2
> >> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohEWp
> >> t
> >> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
> >> progress)
> >>
> >> Previous Thread DISCUSS:
> >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
> >> t
> >> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1afa
> >> 5
> >> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cs
> >> e
> >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
> >> 6
> >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUWWY
> >> D
> >> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> >> Previous Thread VOTE:
> >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
> >> t
> >> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f76505834978
> >> 1
> >> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cs
> >> e
> >> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
> >> 6
> >> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQcAm
> >> 2
> >> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
> >>
> >> Appreciate your timely action.
> >>
> >> PS: Initiating a separate thread as I was not able to reply to the
> >> existing threads...
> >>
> >> Thanks,
> >> Senthil
> >>
>
>

-- 
-- Guozhang

RE: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

Hi all,

I will start the vote thread shortly for this updated KIP. If there are any more thoughts I would love to hear them.

Thanks,
Senthil

-----Original Message-----
From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID> 
Sent: Thursday, October 31, 2019 3:51 AM
To: dev@kafka.apache.org
Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction

Hi Matthias

Thanks for the response.

(1) Yes

(2) Yes, and the config name will be the same (i.e. `log.cleaner.compaction.strategy` & `log.cleaner.compaction.strategy.header`) at broker level and topic level (to override broker level default compact strategy). Please let me know if we need to keep it in different naming convention. Note: Broker level (which will be in the server.properties) configuration is optional and default it to offset. Topic level configuration will be default to broker level config...

(3) By this new way, it avoids another config parameter and also in feature if any new strategy like header need addition info, no additional config required. As this got discussed already and agreed to have separate config, I will revert it. KIP updated...

(4) Done

(5) Updated

(6) Updated to pick the first header in the list

Please let me know if you have any other questions.

Thanks,
Senthil

-----Original Message-----
From: Matthias J. Sax <ma...@confluent.io>
Sent: Thursday, October 31, 2019 12:13 AM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Thanks for picking up this KIP, Senthil.

(1) As far as I remember, the main issue of the original proposal was a missing topic level configuration for the compaction strategy. With this being addressed, I am in favor of this KIP.

(2) With regard to (1), it seems we would need a new topic level config `compaction.strategy`, and `log.cleaner.compaction.strategy` would be the default strategy (ie, broker level config) if a topic does not overwrite it?

(3) Why did you remove `log.cleaner.compaction.strategy.header`
parameter and change the accepted values of `log.cleaner.compaction.strategy` to "header.<key>" instead of keeping "header"? The original approach seems to be cleaner, and I think this was discussed on the original discuss thread already.

(4) Nit: For the "timestamp" compaction strategy you changed the KIP to

-> `The record [create] timestamp`

This is miss leading IMHO, because it depends on the broker/log configuration `(log.)message.timestamp.type` that can either be `CreateTime` or `LogAppendTime` what the actual record timestamp is. I would just remove "create" to keep it unspecified.

(5) Nit: the section "Public Interfaces" should list the newly introduced configs -- configuration parameters are a public interface.

(6) What do you mean by "first level header lookup"? The term "first level" indicates some hierarchy, but headers don't have any hierarchy -- it's just a list of key-value pairs? If you mean the _order_ of the headers, ie, pick the first header in the list that matches the key, please rephrase it to make it clearer.



@Tom: I agree with all you are saying, however, I still think that this KIP will improve the overall situation, because everything you pointed out is actually true with offset based compaction, too.

The KIP is not a silver bullet that solves all issue for interleaved writes, but I personally believe, it's a good improvement.



-Matthias


On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> Hi,
> 
> Please let me know if anyone has any questions on this updated KIP-280...
> 
> Thanks,
> 
> Senthil
> 
> -----Original Message-----
> From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> Sent: Monday, October 28, 2019 11:36 PM
> To: dev@kafka.apache.org
> Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Hi Tom,
> 
> Sorry for the delayed response.
> 
> Regarding the fall back to offset decision for both timestamp & header value is based on the previous author discuss https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Csenthilm%40microsoft.com%7Cb5c596140be1436e9fb708d75df04714%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637081159484181661&amp;sdata=%2Fap4F2CdPQe02wNDGkzjzIrxOQRTa2KraQE75dpjTzE%3D&amp;reserved=0 and as per the discussion, it is really required to avoid duplicates.
> 
> And the timestamp strategy is from the original KIP author and we are keeping it as is.
> 
> Finally on the sequence order guarantee by the producer, it is not feasible on waiting for ack in async / multi-threads/processes scenarios and hence the header sequence based compact strategy with producer's responsibility to have a unique sequence generation for the topic-partition-key level.
> 
> Hoping this clarifies all your questions. Please let us know if you have any further questions.
> 
> @Guozhang Wang / @Matthias J. Sax, I see you both had a detail discussion on the original KIP with previous author and it would great to hear your inputs as well.
> 
> Thanks,
> Senthil
> 
> -----Original Message-----
> From: Tom Bentley <tb...@redhat.com>
> Sent: Tuesday, October 22, 2019 2:32 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Hi Senthilnathan,
> 
> In the motivation isn't it a little misleading to say "On the producer 
> side, we clearly preserve an order for the two messages, <K1, V1> <K1,
> V2>"? IMHO, the semantics of the producer are clear that having an 
> V2>observed
> order of sending records from different producers is not sufficient to guarantee ordering on the broker. You really need to send the 2nd record only after the 1st record is acked. It's the difficultly of achieving that in practice that's the true motivation for your KIP.
> 
> I can see the attraction of using timestamps, but it would be helpful to explain how that really solves the problem. When the producers are in different processes on different machines you're relying on their clocks being synchronized, which is a whole subject in itself. Even if they're synchronized the resolution of System.currentTimeMillis() is typically many milliseconds. If your producers are in different threads of the same process that could be a real problem because it makes ties quite likely.
> And you don't explain why it's OK to resolve ties using the offset. The basis of your argument is that the offset is giving you the wrong answer.
> So it seems to me that using it as a tiebreaker is just narrowing the chances of getting the wrong answer. Maybe none of this matters for your use case, but I think it should be spelled out in the KIP, because it surely would matter for similar use cases.
> 
> Using a sequence at least removes the problem of ties, but the interesting bit is now in how you deal with races between threads/processes in getting a sequence number allocated (which is out of scope of the KIP, I guess).
> How is resolving that race any simpler that resolving the motivating race by waiting for the ack of the first record sent?
> 
> Kind regards,
> 
> Tom
> 
> On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:
> 
>> Hi All,
>>
>> We are bring back the KIP-280 to live with small correct for the 
>> discussion & voting. Thanks to previous author Luis Cabral on the
>> KIP-280 initiation and we are taking over to complete and get it into 2.4...
>>
>> Below is the correction that we made to the existing KIP-280:
>>
>>   *   Allowing the compact strategy configuration at the topic level as
>> the log compaction is at the topic level and a broker can have 
>> multiple topics. This allows the flexibility to have the strategy at 
>> both broker level (i.e. for all topics within the broker) and topic 
>> level (i.e. for a subset of topics within a broker) as well...
>>
>> KIP-280:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwi
>> k
>> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanced
>> %
>> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C686c
>> 3
>> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C
>> 0
>> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7Gs
>> 6
>> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST: 
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
>> h
>> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%40m
>> i
>> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab
>> 2
>> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohEWp
>> t
>> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
>> progress)
>>
>> Previous Thread DISCUSS:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
>> t
>> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1afa
>> 5
>> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cs
>> e
>> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
>> 6
>> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUWWY
>> D
>> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
>> Previous Thread VOTE:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
>> t
>> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f76505834978
>> 1
>> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cs
>> e
>> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
>> 6
>> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQcAm
>> 2
>> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
>>
>> Appreciate your timely action.
>>
>> PS: Initiating a separate thread as I was not able to reply to the 
>> existing threads...
>>
>> Thanks,
>> Senthil
>>

RE: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

Hi Matthias

Thanks for the response.

(1) Yes

(2) Yes, and the config name will be the same (i.e. `log.cleaner.compaction.strategy` & `log.cleaner.compaction.strategy.header`) at broker level and topic level (to override broker level default compact strategy). Please let me know if we need to keep it in different naming convention. Note: Broker level (which will be in the server.properties) configuration is optional and default it to offset. Topic level configuration will be default to broker level config...

(3) By this new way, it avoids another config parameter and also in feature if any new strategy like header need addition info, no additional config required. As this got discussed already and agreed to have separate config, I will revert it. KIP updated...

(4) Done

(5) Updated

(6) Updated to pick the first header in the list

Please let me know if you have any other questions.

Thanks,
Senthil

-----Original Message-----
From: Matthias J. Sax <ma...@confluent.io>
Sent: Thursday, October 31, 2019 12:13 AM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Thanks for picking up this KIP, Senthil.

(1) As far as I remember, the main issue of the original proposal was a missing topic level configuration for the compaction strategy. With this being addressed, I am in favor of this KIP.

(2) With regard to (1), it seems we would need a new topic level config `compaction.strategy`, and `log.cleaner.compaction.strategy` would be the default strategy (ie, broker level config) if a topic does not overwrite it?

(3) Why did you remove `log.cleaner.compaction.strategy.header`
parameter and change the accepted values of `log.cleaner.compaction.strategy` to "header.<key>" instead of keeping "header"? The original approach seems to be cleaner, and I think this was discussed on the original discuss thread already.

(4) Nit: For the "timestamp" compaction strategy you changed the KIP to

-> `The record [create] timestamp`

This is miss leading IMHO, because it depends on the broker/log configuration `(log.)message.timestamp.type` that can either be `CreateTime` or `LogAppendTime` what the actual record timestamp is. I would just remove "create" to keep it unspecified.

(5) Nit: the section "Public Interfaces" should list the newly introduced configs -- configuration parameters are a public interface.

(6) What do you mean by "first level header lookup"? The term "first level" indicates some hierarchy, but headers don't have any hierarchy -- it's just a list of key-value pairs? If you mean the _order_ of the headers, ie, pick the first header in the list that matches the key, please rephrase it to make it clearer.



@Tom: I agree with all you are saying, however, I still think that this KIP will improve the overall situation, because everything you pointed out is actually true with offset based compaction, too.

The KIP is not a silver bullet that solves all issue for interleaved writes, but I personally believe, it's a good improvement.



-Matthias


On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> Hi,
> 
> Please let me know if anyone has any questions on this updated KIP-280...
> 
> Thanks,
> 
> Senthil
> 
> -----Original Message-----
> From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID>
> Sent: Monday, October 28, 2019 11:36 PM
> To: dev@kafka.apache.org
> Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Hi Tom,
> 
> Sorry for the delayed response.
> 
> Regarding the fall back to offset decision for both timestamp & header value is based on the previous author discuss https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Csenthilm%40microsoft.com%7Cfce3eed73837437b5d6b08d75c3a4692%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637079277707943399&amp;sdata=pTrTHHOD5KCtLM2YXSMWmPBMx%2BI9L5PeEe5QEcwzkKA%3D&amp;reserved=0 and as per the discussion, it is really required to avoid duplicates.
> 
> And the timestamp strategy is from the original KIP author and we are keeping it as is.
> 
> Finally on the sequence order guarantee by the producer, it is not feasible on waiting for ack in async / multi-threads/processes scenarios and hence the header sequence based compact strategy with producer's responsibility to have a unique sequence generation for the topic-partition-key level.
> 
> Hoping this clarifies all your questions. Please let us know if you have any further questions.
> 
> @Guozhang Wang / @Matthias J. Sax, I see you both had a detail discussion on the original KIP with previous author and it would great to hear your inputs as well.
> 
> Thanks,
> Senthil
> 
> -----Original Message-----
> From: Tom Bentley <tb...@redhat.com>
> Sent: Tuesday, October 22, 2019 2:32 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Hi Senthilnathan,
> 
> In the motivation isn't it a little misleading to say "On the producer 
> side, we clearly preserve an order for the two messages, <K1, V1> <K1,
> V2>"? IMHO, the semantics of the producer are clear that having an 
> V2>observed
> order of sending records from different producers is not sufficient to guarantee ordering on the broker. You really need to send the 2nd record only after the 1st record is acked. It's the difficultly of achieving that in practice that's the true motivation for your KIP.
> 
> I can see the attraction of using timestamps, but it would be helpful to explain how that really solves the problem. When the producers are in different processes on different machines you're relying on their clocks being synchronized, which is a whole subject in itself. Even if they're synchronized the resolution of System.currentTimeMillis() is typically many milliseconds. If your producers are in different threads of the same process that could be a real problem because it makes ties quite likely.
> And you don't explain why it's OK to resolve ties using the offset. The basis of your argument is that the offset is giving you the wrong answer.
> So it seems to me that using it as a tiebreaker is just narrowing the chances of getting the wrong answer. Maybe none of this matters for your use case, but I think it should be spelled out in the KIP, because it surely would matter for similar use cases.
> 
> Using a sequence at least removes the problem of ties, but the interesting bit is now in how you deal with races between threads/processes in getting a sequence number allocated (which is out of scope of the KIP, I guess).
> How is resolving that race any simpler that resolving the motivating race by waiting for the ack of the first record sent?
> 
> Kind regards,
> 
> Tom
> 
> On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:
> 
>> Hi All,
>>
>> We are bring back the KIP-280 to live with small correct for the 
>> discussion & voting. Thanks to previous author Luis Cabral on the
>> KIP-280 initiation and we are taking over to complete and get it into 2.4...
>>
>> Below is the correction that we made to the existing KIP-280:
>>
>>   *   Allowing the compact strategy configuration at the topic level as
>> the log compaction is at the topic level and a broker can have 
>> multiple topics. This allows the flexibility to have the strategy at 
>> both broker level (i.e. for all topics within the broker) and topic 
>> level (i.e. for a subset of topics within a broker) as well...
>>
>> KIP-280:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwi
>> k
>> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanced
>> %
>> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C686c
>> 3
>> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C
>> 0
>> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7Gs
>> 6
>> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST: 
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
>> h
>> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%40m
>> i
>> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab
>> 2
>> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohEWp
>> t
>> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
>> progress)
>>
>> Previous Thread DISCUSS:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
>> t
>> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1afa
>> 5
>> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cs
>> e
>> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
>> 6
>> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUWWY
>> D
>> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
>> Previous Thread VOTE:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
>> t
>> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f76505834978
>> 1
>> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cs
>> e
>> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf8
>> 6
>> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQcAm
>> 2
>> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
>>
>> Appreciate your timely action.
>>
>> PS: Initiating a separate thread as I was not able to reply to the 
>> existing threads...
>>
>> Thanks,
>> Senthil
>>

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Thanks for picking up this KIP, Senthil.

(1) As far as I remember, the main issue of the original proposal was a
missing topic level configuration for the compaction strategy. With this
being addressed, I am in favor of this KIP.

(2) With regard to (1), it seems we would need a new topic level config
`compaction.strategy`, and `log.cleaner.compaction.strategy` would be
the default strategy (ie, broker level config) if a topic does not
overwrite it?

(3) Why did you remove `log.cleaner.compaction.strategy.header`
parameter and change the accepted values of
`log.cleaner.compaction.strategy` to "header.<key>" instead of keeping
"header"? The original approach seems to be cleaner, and I think this
was discussed on the original discuss thread already.

(4) Nit: For the "timestamp" compaction strategy you changed the KIP to

-> `The record [create] timestamp`

This is miss leading IMHO, because it depends on the broker/log
configuration `(log.)message.timestamp.type` that can either be
`CreateTime` or `LogAppendTime` what the actual record timestamp is. I
would just remove "create" to keep it unspecified.

(5) Nit: the section "Public Interfaces" should list the newly
introduced configs -- configuration parameters are a public interface.

(6) What do you mean by "first level header lookup"? The term "first
level" indicates some hierarchy, but headers don't have any hierarchy --
it's just a list of key-value pairs? If you mean the _order_ of the
headers, ie, pick the first header in the list that matches the key,
please rephrase it to make it clearer.



@Tom: I agree with all you are saying, however, I still think that this
KIP will improve the overall situation, because everything you pointed
out is actually true with offset based compaction, too.

The KIP is not a silver bullet that solves all issue for interleaved
writes, but I personally believe, it's a good improvement.



-Matthias


On 10/30/19 9:45 AM, Senthilnathan Muthusamy wrote:
> Hi,
> 
> Please let me know if anyone has any questions on this updated KIP-280...
> 
> Thanks,
> 
> Senthil
> 
> -----Original Message-----
> From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID> 
> Sent: Monday, October 28, 2019 11:36 PM
> To: dev@kafka.apache.org
> Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Hi Tom,
> 
> Sorry for the delayed response.
> 
> Regarding the fall back to offset decision for both timestamp & header value is based on the previous author discuss https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Csenthilm%40microsoft.com%7Cfce3eed73837437b5d6b08d75c3a4692%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637079277707943399&amp;sdata=pTrTHHOD5KCtLM2YXSMWmPBMx%2BI9L5PeEe5QEcwzkKA%3D&amp;reserved=0 and as per the discussion, it is really required to avoid duplicates.
> 
> And the timestamp strategy is from the original KIP author and we are keeping it as is.
> 
> Finally on the sequence order guarantee by the producer, it is not feasible on waiting for ack in async / multi-threads/processes scenarios and hence the header sequence based compact strategy with producer's responsibility to have a unique sequence generation for the topic-partition-key level.
> 
> Hoping this clarifies all your questions. Please let us know if you have any further questions.
> 
> @Guozhang Wang / @Matthias J. Sax, I see you both had a detail discussion on the original KIP with previous author and it would great to hear your inputs as well.
> 
> Thanks,
> Senthil
> 
> -----Original Message-----
> From: Tom Bentley <tb...@redhat.com>
> Sent: Tuesday, October 22, 2019 2:32 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Hi Senthilnathan,
> 
> In the motivation isn't it a little misleading to say "On the producer side, we clearly preserve an order for the two messages, <K1, V1> <K1,
> V2>"? IMHO, the semantics of the producer are clear that having an 
> V2>observed
> order of sending records from different producers is not sufficient to guarantee ordering on the broker. You really need to send the 2nd record only after the 1st record is acked. It's the difficultly of achieving that in practice that's the true motivation for your KIP.
> 
> I can see the attraction of using timestamps, but it would be helpful to explain how that really solves the problem. When the producers are in different processes on different machines you're relying on their clocks being synchronized, which is a whole subject in itself. Even if they're synchronized the resolution of System.currentTimeMillis() is typically many milliseconds. If your producers are in different threads of the same process that could be a real problem because it makes ties quite likely.
> And you don't explain why it's OK to resolve ties using the offset. The basis of your argument is that the offset is giving you the wrong answer.
> So it seems to me that using it as a tiebreaker is just narrowing the chances of getting the wrong answer. Maybe none of this matters for your use case, but I think it should be spelled out in the KIP, because it surely would matter for similar use cases.
> 
> Using a sequence at least removes the problem of ties, but the interesting bit is now in how you deal with races between threads/processes in getting a sequence number allocated (which is out of scope of the KIP, I guess).
> How is resolving that race any simpler that resolving the motivating race by waiting for the ack of the first record sent?
> 
> Kind regards,
> 
> Tom
> 
> On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:
> 
>> Hi All,
>>
>> We are bring back the KIP-280 to live with small correct for the 
>> discussion & voting. Thanks to previous author Luis Cabral on the
>> KIP-280 initiation and we are taking over to complete and get it into 2.4...
>>
>> Below is the correction that we made to the existing KIP-280:
>>
>>   *   Allowing the compact strategy configuration at the topic level as
>> the log compaction is at the topic level and a broker can have 
>> multiple topics. This allows the flexibility to have the strategy at 
>> both broker level (i.e. for all topics within the broker) and topic 
>> level (i.e. for a subset of topics within a broker) as well...
>>
>> KIP-280:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwik
>> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanced%
>> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C686c3
>> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0
>> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7Gs6
>> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST: 
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
>> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%40mi
>> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2
>> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohEWpt
>> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
>> progress)
>>
>> Previous Thread DISCUSS:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
>> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1afa5
>> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cse
>> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86
>> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUWWYD
>> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
>> Previous Thread VOTE:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
>> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f765058349781
>> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cse
>> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86
>> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQcAm2
>> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
>>
>> Appreciate your timely action.
>>
>> PS: Initiating a separate thread as I was not able to reply to the 
>> existing threads...
>>
>> Thanks,
>> Senthil
>>

RE: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

Hi,

Please let me know if anyone has any questions on this updated KIP-280...

Thanks,

Senthil

-----Original Message-----
From: Senthilnathan Muthusamy <se...@microsoft.com.INVALID> 
Sent: Monday, October 28, 2019 11:36 PM
To: dev@kafka.apache.org
Subject: RE: [DISCUSS] KIP-280: Enhanced log compaction

Hi Tom,

Sorry for the delayed response.

Regarding the fall back to offset decision for both timestamp & header value is based on the previous author discuss https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2Ff44317eb6cd34f91966654c80509d4a457dbbccdd02b86645782be67%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Csenthilm%40microsoft.com%7Cfce3eed73837437b5d6b08d75c3a4692%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637079277707943399&amp;sdata=pTrTHHOD5KCtLM2YXSMWmPBMx%2BI9L5PeEe5QEcwzkKA%3D&amp;reserved=0 and as per the discussion, it is really required to avoid duplicates.

And the timestamp strategy is from the original KIP author and we are keeping it as is.

Finally on the sequence order guarantee by the producer, it is not feasible on waiting for ack in async / multi-threads/processes scenarios and hence the header sequence based compact strategy with producer's responsibility to have a unique sequence generation for the topic-partition-key level.

Hoping this clarifies all your questions. Please let us know if you have any further questions.

@Guozhang Wang / @Matthias J. Sax, I see you both had a detail discussion on the original KIP with previous author and it would great to hear your inputs as well.

Thanks,
Senthil

-----Original Message-----
From: Tom Bentley <tb...@redhat.com>
Sent: Tuesday, October 22, 2019 2:32 AM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Hi Senthilnathan,

In the motivation isn't it a little misleading to say "On the producer side, we clearly preserve an order for the two messages, <K1, V1> <K1,
V2>"? IMHO, the semantics of the producer are clear that having an 
V2>observed
order of sending records from different producers is not sufficient to guarantee ordering on the broker. You really need to send the 2nd record only after the 1st record is acked. It's the difficultly of achieving that in practice that's the true motivation for your KIP.

I can see the attraction of using timestamps, but it would be helpful to explain how that really solves the problem. When the producers are in different processes on different machines you're relying on their clocks being synchronized, which is a whole subject in itself. Even if they're synchronized the resolution of System.currentTimeMillis() is typically many milliseconds. If your producers are in different threads of the same process that could be a real problem because it makes ties quite likely.
And you don't explain why it's OK to resolve ties using the offset. The basis of your argument is that the offset is giving you the wrong answer.
So it seems to me that using it as a tiebreaker is just narrowing the chances of getting the wrong answer. Maybe none of this matters for your use case, but I think it should be spelled out in the KIP, because it surely would matter for similar use cases.

Using a sequence at least removes the problem of ties, but the interesting bit is now in how you deal with races between threads/processes in getting a sequence number allocated (which is out of scope of the KIP, I guess).
How is resolving that race any simpler that resolving the motivating race by waiting for the ack of the first record sent?

Kind regards,

Tom

On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:

> Hi All,
>
> We are bring back the KIP-280 to live with small correct for the 
> discussion & voting. Thanks to previous author Luis Cabral on the
> KIP-280 initiation and we are taking over to complete and get it into 2.4...
>
> Below is the correction that we made to the existing KIP-280:
>
>   *   Allowing the compact strategy configuration at the topic level as
> the log compaction is at the topic level and a broker can have 
> multiple topics. This allows the flexibility to have the strategy at 
> both broker level (i.e. for all topics within the broker) and topic 
> level (i.e. for a subset of topics within a broker) as well...
>
> KIP-280:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwik
> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanced%
> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C686c3
> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0
> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7Gs6
> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST: 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%40mi
> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2
> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohEWpt
> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in
> progress)
>
> Previous Thread DISCUSS:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1afa5
> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cse
> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86
> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUWWYD
> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> Previous Thread VOTE:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f765058349781
> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cse
> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86
> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQcAm2
> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
>
> Appreciate your timely action.
>
> PS: Initiating a separate thread as I was not able to reply to the 
> existing threads...
>
> Thanks,
> Senthil
>

RE: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Senthilnathan Muthusamy <se...@microsoft.com.INVALID>.

Hi Tom,

Sorry for the delayed response.

Regarding the fall back to offset decision for both timestamp & header value is based on the previous author discuss https://lists.apache.org/thread.html/f44317eb6cd34f91966654c80509d4a457dbbccdd02b86645782be67@%3Cdev.kafka.apache.org%3E and as per the discussion, it is really required to avoid duplicates.

And the timestamp strategy is from the original KIP author and we are keeping it as is.

Finally on the sequence order guarantee by the producer, it is not feasible on waiting for ack in async / multi-threads/processes scenarios and hence the header sequence based compact strategy with producer's responsibility to have a unique sequence generation for the topic-partition-key level.

Hoping this clarifies all your questions. Please let us know if you have any further questions.

@Guozhang Wang / @Matthias J. Sax, I see you both had a detail discussion on the original KIP with previous author and it would great to hear your inputs as well.

Thanks,
Senthil

-----Original Message-----
From: Tom Bentley <tb...@redhat.com> 
Sent: Tuesday, October 22, 2019 2:32 AM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Hi Senthilnathan,

In the motivation isn't it a little misleading to say "On the producer side, we clearly preserve an order for the two messages, <K1, V1> <K1,
V2>"? IMHO, the semantics of the producer are clear that having an 
V2>observed
order of sending records from different producers is not sufficient to guarantee ordering on the broker. You really need to send the 2nd record only after the 1st record is acked. It's the difficultly of achieving that in practice that's the true motivation for your KIP.

I can see the attraction of using timestamps, but it would be helpful to explain how that really solves the problem. When the producers are in different processes on different machines you're relying on their clocks being synchronized, which is a whole subject in itself. Even if they're synchronized the resolution of System.currentTimeMillis() is typically many milliseconds. If your producers are in different threads of the same process that could be a real problem because it makes ties quite likely.
And you don't explain why it's OK to resolve ties using the offset. The basis of your argument is that the offset is giving you the wrong answer.
So it seems to me that using it as a tiebreaker is just narrowing the chances of getting the wrong answer. Maybe none of this matters for your use case, but I think it should be spelled out in the KIP, because it surely would matter for similar use cases.

Using a sequence at least removes the problem of ties, but the interesting bit is now in how you deal with races between threads/processes in getting a sequence number allocated (which is out of scope of the KIP, I guess).
How is resolving that race any simpler that resolving the motivating race by waiting for the ack of the first record sent?

Kind regards,

Tom

On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy <se...@microsoft.com.invalid> wrote:

> Hi All,
>
> We are bring back the KIP-280 to live with small correct for the 
> discussion & voting. Thanks to previous author Luis Cabral on the 
> KIP-280 initiation and we are taking over to complete and get it into 2.4...
>
> Below is the correction that we made to the existing KIP-280:
>
>   *   Allowing the compact strategy configuration at the topic level as
> the log compaction is at the topic level and a broker can have 
> multiple topics. This allows the flexibility to have the strategy at 
> both broker level (i.e. for all topics within the broker) and topic 
> level (i.e. for a subset of topics within a broker) as well...
>
> KIP-280:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwik
> i.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-280%253A%2BEnhanced%
> 2Blog%2Bcompaction&amp;data=02%7C01%7Csenthilm%40microsoft.com%7C686c3
> 2fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0
> %7C637073341017520406&amp;sdata=KrRem2KWCBscHX963Ah8wZ%2Fj9dkhCeAa7Gs6
> XqJ%2F5SQ%3D&amp;reserved=0 PULL REQUEST: 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2Fapache%2Fkafka%2Fpull%2F7528&amp;data=02%7C01%7Csenthilm%40mi
> crosoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86f141af91ab2
> d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=bt32PgDUjJjpXohEWpt
> Fxv6mPERCwcRFlVROzinBtnk%3D&amp;reserved=0 (unit test coverage in 
> progress)
>
> Previous Thread DISCUSS:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.apache.org%2Fthread.html%2F79aa6e50d7c737ddf83455dd8063692a535a1afa5
> 58620fe1a1496d3%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cse
> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86
> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=XwcUWWYD
> PV1nA%2BbkDGLFNlXZ5bysVblWUTDQEzAaKxM%3D&amp;reserved=0
> Previous Thread VOTE:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.apache.org%2Fthread.html%2Fb2ecd73ce849741f0c40b4f801c3f765058349781
> 2713e240e1ac2b7%40%253Cdev.kafka.apache.org%253E&amp;data=02%7C01%7Cse
> nthilm%40microsoft.com%7C686c32fa4a554d61ae1408d756d409f6%7C72f988bf86
> f141af91ab2d7cd011db47%7C1%7C0%7C637073341017520406&amp;sdata=8cKQcAm2
> DDVGVLTKtciYKGMiI%2FgOADW6tam9nem4lsg%3D&amp;reserved=0
>
> Appreciate your timely action.
>
> PS: Initiating a separate thread as I was not able to reply to the 
> existing threads...
>
> Thanks,
> Senthil
>

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Tom Bentley <tb...@redhat.com>.

Hi Senthilnathan,

In the motivation isn't it a little misleading to say "On the producer
side, we clearly preserve an order for the two messages, <K1, V1> <K1,
V2>"? IMHO, the semantics of the producer are clear that having an observed
order of sending records from different producers is not sufficient to
guarantee ordering on the broker. You really need to send the 2nd record
only after the 1st record is acked. It's the difficultly of achieving that
in practice that's the true motivation for your KIP.

I can see the attraction of using timestamps, but it would be helpful to
explain how that really solves the problem. When the producers are in
different processes on different machines you're relying on their clocks
being synchronized, which is a whole subject in itself. Even if they're
synchronized the resolution of System.currentTimeMillis() is typically many
milliseconds. If your producers are in different threads of the same
process that could be a real problem because it makes ties quite likely.
And you don't explain why it's OK to resolve ties using the offset. The
basis of your argument is that the offset is giving you the wrong answer.
So it seems to me that using it as a tiebreaker is just narrowing the
chances of getting the wrong answer. Maybe none of this matters for your
use case, but I think it should be spelled out in the KIP, because it
surely would matter for similar use cases.

Using a sequence at least removes the problem of ties, but the interesting
bit is now in how you deal with races between threads/processes in getting
a sequence number allocated (which is out of scope of the KIP, I guess).
How is resolving that race any simpler that resolving the motivating race
by waiting for the ack of the first record sent?

Kind regards,

Tom

On Mon, Oct 21, 2019 at 9:06 PM Senthilnathan Muthusamy
<se...@microsoft.com.invalid> wrote:

> Hi All,
>
> We are bring back the KIP-280 to live with small correct for the
> discussion & voting. Thanks to previous author Luis Cabral on the KIP-280
> initiation and we are taking over to complete and get it into 2.4...
>
> Below is the correction that we made to the existing KIP-280:
>
>   *   Allowing the compact strategy configuration at the topic level as
> the log compaction is at the topic level and a broker can have multiple
> topics. This allows the flexibility to have the strategy at both broker
> level (i.e. for all topics within the broker) and topic level (i.e. for a
> subset of topics within a broker) as well...
>
> KIP-280:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-280%3A+Enhanced+log+compaction
> PULL REQUEST: https://github.com/apache/kafka/pull/7528 (unit test
> coverage in progress)
>
> Previous Thread DISCUSS:
> https://lists.apache.org/thread.html/79aa6e50d7c737ddf83455dd8063692a535a1afa558620fe1a1496d3@%3Cdev.kafka.apache.org%3E
> Previous Thread VOTE:
> https://lists.apache.org/thread.html/b2ecd73ce849741f0c40b4f801c3f7650583497812713e240e1ac2b7@%3Cdev.kafka.apache.org%3E
>
> Appreciate your timely action.
>
> PS: Initiating a separate thread as I was not able to reply to the
> existing threads...
>
> Thanks,
> Senthil
>