You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Guozhang Wang <wa...@gmail.com> on 2018/09/01 18:32:50 UTC

Re: [DISCUSS] KIP-280: Enhanced log compaction

Hello Luis,

Thanks for your thoughtful responses, here are my two cents:

1) I think having the new configs with per-topic granularity would not
introduce much memory overhead or logic complexity, as all you need is to
remember this at the topic metadata cache. If I've missed some details
about the complexity, could you elaborate a bit more?

2) I agree with you: the current `ConfigDef.Validator` only scope on the
validated config value itself, and hence cannot be dependent on another
config.

4) I think Jun's point is that since we need the latest message in the log
segment for the timestamp tracking, we cannot delete it actually: with
offset based only policy, this is naturally guaranteed; but now with other
policy, it is not guaranteed to never be compacted away, and hence we need
to "enforce" to special-handle this case and not delete it.



Guozhang



Guozhang


On Wed, Aug 29, 2018 at 9:25 AM, Luís Cabral <lu...@yahoo.com.invalid>
wrote:

> Hi all,
>
> Since there has been a rejuvenated interest in this KIP, it felt better to
> downgrade it back down to [DISCUSSION], as we aren't really voting on it
> anymore.
> I'll try to address the currently pending questions on the following
> points, so please bear with me while we go through them all:
>
> 1) Configuration: Topic vs Global
>
> Here we all agree that having a configuration per-topic is the best
> option. However, this is not possible with the current design of the
> compaction solution. Yes, it is true that "some" properties that relate to
> compaction are configurable per-topic, but those are only the properties
> that act outside(!) of the actual compaction logic, such as configuring the
> start-compaction trigger with "ratio" or compaction duration with "lag.ms
> ".
> This logic can, of course, be re-designed to suit our wishes, but this is
> not a direct effort, and if we have spent months arguing about the extra 8
> bytes per record, for sure we would spend another few dozen months
> discussing the memory impact that doing this change to the properties will
> invariably have.
> As such, I will limit this KIP to ONLY have these configurations globally.
>
> 2) Configuration: Fail-fast vs Fallback
>
>
> Ideally, I would also like to prevent the application to start if the
> configuration is somehow invalid.
> However (another 'however'), the way the configuration is handled prevents
> adding dependencies between them, so we can't add logic that says
> "configuration X is invalid if configuration Y is so-and-such".
> Again, this can be re-designed to add this feature to the configuration
> logic, but it would again be a big change just by itself, so this KIP is
> again limited to use ONLY what is already in place.
>
> 3) Documenting the memory impact on the KIP
>
> This is now added to the KIP, though this topic is more complicated than
> 'memory impact'. E.g.: this change doesn't translate to an actual memory
> impact, it just means that the compaction will potentially encompass less
> records per execution.
>
> 4) Documenting how we deal with the last message in the log
>
> I have 2 interpretations for this request: "the last message in the log"
> or "the last message with a shared key on the log"
> For the former: there is no change to the logic on how the last message is
> handled. Only the "tail" gets compacted, so the "head" (which includes the
> last message) still keeps the last message
>
> 5) Documenting how the key deletion will be handled
>
> I'm having some trouble understanding this one; do you mean how keys are
> deleted in general, or?
>
> Cheers,
> Luis Cabral
>
>    On Friday, August 24, 2018, 1:54:54 AM GMT+2, Jun Rao <ju...@confluent.io>
> wrote:
>
>  Hi, Luis,
>
> Thanks for the reply. A few more comments below.
>
> 1. About the topic level configuration. It seems that it's useful for the
> new configs to be at the topic level. Currently, the following configs
> related to compaction are already at the topic level.
>
> min.cleanable.dirty.ratio
> min.compaction.lag.ms
> cleanup.policy
>
> 2. Have you documented the memory impact in the KIP?
>
> 3. Could you document how we deal with the last message in the log, which
> is potentially cleanable now?
>
> 4. Could you document how key deletion will be handled?
>
> 10. As for Jason's proposal on CompactionStrategy, it does make the feature
> more general. On the other hand, it will be useful not to require
> user-level code if the compaction value only comes from the header.
>
> 20. "If compaction.strategy.header is chosen and compaction.strategy.header
> is not set, the KIP falls back to offset." I am wondering if it's better to
> just fail the configuration in the case.
>
> Jun
>
>
>
> On Thu, Aug 16, 2018 at 1:33 PM, Guozhang Wang <wa...@gmail.com> wrote:
>
> > Regarding "broker-agnostic of headers": there are some KIPs from Streams
> to
> > use headers for internal purposes as well, e.g. KIP-258 and KIP-213 (I
> > admit there may be a conflict with user space, but practically I think it
> > is very rare). So I think we are very likely going to make Kafka
> internals
> > to be "headers-aware" anyways.
> >
> > Regarding the general API: I think it is a good idea in general, but it
> may
> > still have limits: note that right now our KIP enforce a header type to
> be
> > long, and we have a very careful discussion about the fall-back policy if
> > header does not have the specified key or if the value is not long-typed;
> > but if we enforce long type version in the interface, it would require
> > users trying to customizing their compaction logic (think: based on some
> > value payload field) to transform their fields to long as well. So I feel
> > we can still go with the current proposed approach, and only consider
> this
> > general API if we observe it does have a general usage requirement. By
> that
> > time we can still extend the config values of "log.cleaner.compaction.
> > strategy" to "offset, timestamp, header, myFuncName".
> >
> > @Bertus
> >
> > Thanks for your feedback, I believe the proposed config is indeed for
> both
> > global (for the whole broker) and per-topic, Luís can confirm if this is
> > the case, and update the wiki page to make it clear.
> >
> >
> > Guozhang
> >
> >
> > On Thu, Aug 16, 2018 at 9:09 AM, Bertus Greeff <
> > bgreeff@microsoft.com.invalid> wrote:
> >
> > > I'm interested to know the status of this KIP.  I see that the status
> is
> > > "Voting".  How long does this normally take?
> > >
> > > We want to use Kafka and this KIP provides exactly the log compaction
> > > logic that we want for many of our projects.
> > >
> > > One piece of feedback that I have is that log.cleaner.compaction.
> > strategy
> > > and log.cleaner.compaction.strategy.header needs to be per topic.  The
> > > text of the KIP makes it sound that the config is only available for
> all
> > > topics but this makes no sense.  Different topics will need different
> > > strategies and/or headers.
> > >
> > > From the KIP:
> > > Provide the configuration for the individual topics
> > > None of the configurations for log compaction are available at topic
> > > level, so adding it there is not a part of this KIP
> > >
> > >
> > >
> > > On 2018/04/05 08:44:00, Luís Cabral <l....@yahoo.com.INVALID> wrote:
> > > > Hello all,>
> > > > Starting a discussion for this feature.>
> > > > KIP-280  :  https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 280%
> > > 3A+Enhanced+log+compactionPull-4822 :  https://github.com/apache/kafk
> > > a/pull/4822>
> > >
> > > > Kind Regards,Luís>
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Luís Cabral <lu...@yahoo.com.INVALID>.
 Since this is not moving forward, how about I proceed with the currently documented changes, and any improvements (such as configuration changes) can be taken up afterwards by whoever wants it under a different KIP?
    On Thursday, October 11, 2018, 9:47:12 AM GMT+2, Luís Cabral <lu...@yahoo.com> wrote:  
 
  Hi Matthias,

How can this be done?

Kind Regards,
Luis
 
    On Sunday, September 30, 2018, 9:10:01 PM GMT+2, Matthias J. Sax <ma...@confluent.io> wrote:  
 
 Luis,

What is the status of this KIP?

I tend to agree, that introducing the feature only globally, might be
less useful (I would assume that people want to turn it on, on a
per-topic basis). As I am not familiar with the corresponding code, I
cannot judge the complexity to add topic level configs, however, it
seems to be worth to include it in the KIP.


-Matthias



On 9/21/18 1:59 PM, Bertus Greeff wrote:
> Someone pointed out to me that my scenario is also resolved by using Kafka transactions.  Zombie fencing which is essentially my scenario was one of the scenarios that transactions were designed to solve.  I was going to use the ideas of this KIP to solve it but using transactions seems even better because out of order messages never even make it into the topic.  They are blocked by the broker.
> 
> -----Original Message-----
> From: Guozhang Wang <wa...@gmail.com> 
> Sent: Saturday, September 1, 2018 11:33 AM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Hello Luis,
> 
> Thanks for your thoughtful responses, here are my two cents:
> 
> 1) I think having the new configs with per-topic granularity would not introduce much memory overhead or logic complexity, as all you need is to remember this at the topic metadata cache. If I've missed some details about the complexity, could you elaborate a bit more?
> 
> 2) I agree with you: the current `ConfigDef.Validator` only scope on the validated config value itself, and hence cannot be dependent on another config.
> 
> 4) I think Jun's point is that since we need the latest message in the log segment for the timestamp tracking, we cannot delete it actually: with offset based only policy, this is naturally guaranteed; but now with other policy, it is not guaranteed to never be compacted away, and hence we need to "enforce" to special-handle this case and not delete it.
> 
> 
> 
> Guozhang
> 
> 
> 
> Guozhang
> 
> 
> On Wed, Aug 29, 2018 at 9:25 AM, Luís Cabral <lu...@yahoo.com.invalid>
> wrote:
> 
>> Hi all,
>>
>> Since there has been a rejuvenated interest in this KIP, it felt 
>> better to downgrade it back down to [DISCUSSION], as we aren't really 
>> voting on it anymore.
>> I'll try to address the currently pending questions on the following 
>> points, so please bear with me while we go through them all:
>>
>> 1) Configuration: Topic vs Global
>>
>> Here we all agree that having a configuration per-topic is the best 
>> option. However, this is not possible with the current design of the 
>> compaction solution. Yes, it is true that "some" properties that 
>> relate to compaction are configurable per-topic, but those are only 
>> the properties that act outside(!) of the actual compaction logic, 
>> such as configuring the start-compaction trigger with "ratio" or 
>> compaction duration with "lag.ms ".
>> This logic can, of course, be re-designed to suit our wishes, but this 
>> is not a direct effort, and if we have spent months arguing about the 
>> extra 8 bytes per record, for sure we would spend another few dozen 
>> months discussing the memory impact that doing this change to the 
>> properties will invariably have.
>> As such, I will limit this KIP to ONLY have these configurations globally.
>>
>> 2) Configuration: Fail-fast vs Fallback
>>
>>
>> Ideally, I would also like to prevent the application to start if the 
>> configuration is somehow invalid.
>> However (another 'however'), the way the configuration is handled 
>> prevents adding dependencies between them, so we can't add logic that 
>> says "configuration X is invalid if configuration Y is so-and-such".
>> Again, this can be re-designed to add this feature to the 
>> configuration logic, but it would again be a big change just by 
>> itself, so this KIP is again limited to use ONLY what is already in place.
>>
>> 3) Documenting the memory impact on the KIP
>>
>> This is now added to the KIP, though this topic is more complicated 
>> than 'memory impact'. E.g.: this change doesn't translate to an actual 
>> memory impact, it just means that the compaction will potentially 
>> encompass less records per execution.
>>
>> 4) Documenting how we deal with the last message in the log
>>
>> I have 2 interpretations for this request: "the last message in the log"
>> or "the last message with a shared key on the log"
>> For the former: there is no change to the logic on how the last 
>> message is handled. Only the "tail" gets compacted, so the "head" 
>> (which includes the last message) still keeps the last message
>>
>> 5) Documenting how the key deletion will be handled
>>
>> I'm having some trouble understanding this one; do you mean how keys 
>> are deleted in general, or?
>>
>> Cheers,
>> Luis Cabral
>>
>>    On Friday, August 24, 2018, 1:54:54 AM GMT+2, Jun Rao 
>> <ju...@confluent.io>
>> wrote:
>>
>>  Hi, Luis,
>>
>> Thanks for the reply. A few more comments below.
>>
>> 1. About the topic level configuration. It seems that it's useful for 
>> the new configs to be at the topic level. Currently, the following 
>> configs related to compaction are already at the topic level.
>>
>> min.cleanable.dirty.ratio
>> min.compaction.lag.ms
>> cleanup.policy
>>
>> 2. Have you documented the memory impact in the KIP?
>>
>> 3. Could you document how we deal with the last message in the log, 
>> which is potentially cleanable now?
>>
>> 4. Could you document how key deletion will be handled?
>>
>> 10. As for Jason's proposal on CompactionStrategy, it does make the 
>> feature more general. On the other hand, it will be useful not to 
>> require user-level code if the compaction value only comes from the header.
>>
>> 20. "If compaction.strategy.header is chosen and 
>> compaction.strategy.header is not set, the KIP falls back to offset." 
>> I am wondering if it's better to just fail the configuration in the case.
>>
>> Jun
>>
>>
>>
>> On Thu, Aug 16, 2018 at 1:33 PM, Guozhang Wang <wa...@gmail.com> wrote:
>>
>>> Regarding "broker-agnostic of headers": there are some KIPs from 
>>> Streams
>> to
>>> use headers for internal purposes as well, e.g. KIP-258 and KIP-213 
>>> (I admit there may be a conflict with user space, but practically I 
>>> think it is very rare). So I think we are very likely going to make 
>>> Kafka
>> internals
>>> to be "headers-aware" anyways.
>>>
>>> Regarding the general API: I think it is a good idea in general, but 
>>> it
>> may
>>> still have limits: note that right now our KIP enforce a header type 
>>> to
>> be
>>> long, and we have a very careful discussion about the fall-back 
>>> policy if header does not have the specified key or if the value is 
>>> not long-typed; but if we enforce long type version in the 
>>> interface, it would require users trying to customizing their 
>>> compaction logic (think: based on some value payload field) to 
>>> transform their fields to long as well. So I feel we can still go 
>>> with the current proposed approach, and only consider
>> this
>>> general API if we observe it does have a general usage requirement. 
>>> By
>> that
>>> time we can still extend the config values of "log.cleaner.compaction.
>>> strategy" to "offset, timestamp, header, myFuncName".
>>>
>>> @Bertus
>>>
>>> Thanks for your feedback, I believe the proposed config is indeed 
>>> for
>> both
>>> global (for the whole broker) and per-topic, Luís can confirm if 
>>> this is the case, and update the wiki page to make it clear.
>>>
>>>
>>> Guozhang
>>>
>>>
>>> On Thu, Aug 16, 2018 at 9:09 AM, Bertus Greeff < 
>>> bgreeff@microsoft.com.invalid> wrote:
>>>
>>>> I'm interested to know the status of this KIP.  I see that the 
>>>> status
>> is
>>>> "Voting".  How long does this normally take?
>>>>
>>>> We want to use Kafka and this KIP provides exactly the log 
>>>> compaction logic that we want for many of our projects.
>>>>
>>>> One piece of feedback that I have is that log.cleaner.compaction.
>>> strategy
>>>> and log.cleaner.compaction.strategy.header needs to be per topic.  
>>>> The text of the KIP makes it sound that the config is only 
>>>> available for
>> all
>>>> topics but this makes no sense.  Different topics will need 
>>>> different strategies and/or headers.
>>>>
>>>> From the KIP:
>>>> Provide the configuration for the individual topics None of the 
>>>> configurations for log compaction are available at topic level, so 
>>>> adding it there is not a part of this KIP
>>>>
>>>>
>>>>
>>>> On 2018/04/05 08:44:00, Luís Cabral <l....@yahoo.com.INVALID> wrote:
>>>>> Hello all,>
>>>>> Starting a discussion for this feature.>
>>>>> KIP-280  :  
>>>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>>>>> Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-&amp;data
>>>>> =02%7C01%7Cbgreeff%40microsoft.com%7C42ac1e21f70c4b0a243108d6103
>>>>> 959ed%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6367142358482
>>>>> 30609&amp;sdata=gtulffYXEDbf%2BNt1b1P%2FkGeDQvmcf2PZ9nQlo4oi%2Bo
>>>>> Q%3D&amp;reserved=0
>>> 280%
>>>> 3A+Enhanced+log+compactionPull-4822 :  
>>>> 3A+Enhanced+log+https://na01.safelinks.protection.outlook.com/?url
>>>> 3A+Enhanced+log+=https%3A%2F%2Fgithub.com%2Fapache%2Fkafk&amp;data
>>>> 3A+Enhanced+log+=02%7C01%7Cbgreeff%40microsoft.com%7C42ac1e21f70c4
>>>> 3A+Enhanced+log+b0a243108d6103959ed%7C72f988bf86f141af91ab2d7cd011
>>>> 3A+Enhanced+log+db47%7C1%7C0%7C636714235848230609&amp;sdata=s5lASk
>>>> 3A+Enhanced+log+fvK%2F21wmK8vsbdlXIXINfFFWZCtXzBLWhdCiA%3D&amp;res
>>>> 3A+Enhanced+log+erved=0
>>>> a/pull/4822>
>>>>
>>>>> Kind Regards,Luís>
>>>>
>>>
>>>
>>>
>>> --
>>> -- Guozhang
>>>
>>
> 
> 
> 
> --
> -- Guozhang
> 
    

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Luís Cabral <lu...@yahoo.com.INVALID>.
 Hi Matthias,

How can this be done?

Kind Regards,
Luis
 
    On Sunday, September 30, 2018, 9:10:01 PM GMT+2, Matthias J. Sax <ma...@confluent.io> wrote:  
 
 Luis,

What is the status of this KIP?

I tend to agree, that introducing the feature only globally, might be
less useful (I would assume that people want to turn it on, on a
per-topic basis). As I am not familiar with the corresponding code, I
cannot judge the complexity to add topic level configs, however, it
seems to be worth to include it in the KIP.


-Matthias



On 9/21/18 1:59 PM, Bertus Greeff wrote:
> Someone pointed out to me that my scenario is also resolved by using Kafka transactions.  Zombie fencing which is essentially my scenario was one of the scenarios that transactions were designed to solve.  I was going to use the ideas of this KIP to solve it but using transactions seems even better because out of order messages never even make it into the topic.  They are blocked by the broker.
> 
> -----Original Message-----
> From: Guozhang Wang <wa...@gmail.com> 
> Sent: Saturday, September 1, 2018 11:33 AM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Hello Luis,
> 
> Thanks for your thoughtful responses, here are my two cents:
> 
> 1) I think having the new configs with per-topic granularity would not introduce much memory overhead or logic complexity, as all you need is to remember this at the topic metadata cache. If I've missed some details about the complexity, could you elaborate a bit more?
> 
> 2) I agree with you: the current `ConfigDef.Validator` only scope on the validated config value itself, and hence cannot be dependent on another config.
> 
> 4) I think Jun's point is that since we need the latest message in the log segment for the timestamp tracking, we cannot delete it actually: with offset based only policy, this is naturally guaranteed; but now with other policy, it is not guaranteed to never be compacted away, and hence we need to "enforce" to special-handle this case and not delete it.
> 
> 
> 
> Guozhang
> 
> 
> 
> Guozhang
> 
> 
> On Wed, Aug 29, 2018 at 9:25 AM, Luís Cabral <lu...@yahoo.com.invalid>
> wrote:
> 
>> Hi all,
>>
>> Since there has been a rejuvenated interest in this KIP, it felt 
>> better to downgrade it back down to [DISCUSSION], as we aren't really 
>> voting on it anymore.
>> I'll try to address the currently pending questions on the following 
>> points, so please bear with me while we go through them all:
>>
>> 1) Configuration: Topic vs Global
>>
>> Here we all agree that having a configuration per-topic is the best 
>> option. However, this is not possible with the current design of the 
>> compaction solution. Yes, it is true that "some" properties that 
>> relate to compaction are configurable per-topic, but those are only 
>> the properties that act outside(!) of the actual compaction logic, 
>> such as configuring the start-compaction trigger with "ratio" or 
>> compaction duration with "lag.ms ".
>> This logic can, of course, be re-designed to suit our wishes, but this 
>> is not a direct effort, and if we have spent months arguing about the 
>> extra 8 bytes per record, for sure we would spend another few dozen 
>> months discussing the memory impact that doing this change to the 
>> properties will invariably have.
>> As such, I will limit this KIP to ONLY have these configurations globally.
>>
>> 2) Configuration: Fail-fast vs Fallback
>>
>>
>> Ideally, I would also like to prevent the application to start if the 
>> configuration is somehow invalid.
>> However (another 'however'), the way the configuration is handled 
>> prevents adding dependencies between them, so we can't add logic that 
>> says "configuration X is invalid if configuration Y is so-and-such".
>> Again, this can be re-designed to add this feature to the 
>> configuration logic, but it would again be a big change just by 
>> itself, so this KIP is again limited to use ONLY what is already in place.
>>
>> 3) Documenting the memory impact on the KIP
>>
>> This is now added to the KIP, though this topic is more complicated 
>> than 'memory impact'. E.g.: this change doesn't translate to an actual 
>> memory impact, it just means that the compaction will potentially 
>> encompass less records per execution.
>>
>> 4) Documenting how we deal with the last message in the log
>>
>> I have 2 interpretations for this request: "the last message in the log"
>> or "the last message with a shared key on the log"
>> For the former: there is no change to the logic on how the last 
>> message is handled. Only the "tail" gets compacted, so the "head" 
>> (which includes the last message) still keeps the last message
>>
>> 5) Documenting how the key deletion will be handled
>>
>> I'm having some trouble understanding this one; do you mean how keys 
>> are deleted in general, or?
>>
>> Cheers,
>> Luis Cabral
>>
>>    On Friday, August 24, 2018, 1:54:54 AM GMT+2, Jun Rao 
>> <ju...@confluent.io>
>> wrote:
>>
>>  Hi, Luis,
>>
>> Thanks for the reply. A few more comments below.
>>
>> 1. About the topic level configuration. It seems that it's useful for 
>> the new configs to be at the topic level. Currently, the following 
>> configs related to compaction are already at the topic level.
>>
>> min.cleanable.dirty.ratio
>> min.compaction.lag.ms
>> cleanup.policy
>>
>> 2. Have you documented the memory impact in the KIP?
>>
>> 3. Could you document how we deal with the last message in the log, 
>> which is potentially cleanable now?
>>
>> 4. Could you document how key deletion will be handled?
>>
>> 10. As for Jason's proposal on CompactionStrategy, it does make the 
>> feature more general. On the other hand, it will be useful not to 
>> require user-level code if the compaction value only comes from the header.
>>
>> 20. "If compaction.strategy.header is chosen and 
>> compaction.strategy.header is not set, the KIP falls back to offset." 
>> I am wondering if it's better to just fail the configuration in the case.
>>
>> Jun
>>
>>
>>
>> On Thu, Aug 16, 2018 at 1:33 PM, Guozhang Wang <wa...@gmail.com> wrote:
>>
>>> Regarding "broker-agnostic of headers": there are some KIPs from 
>>> Streams
>> to
>>> use headers for internal purposes as well, e.g. KIP-258 and KIP-213 
>>> (I admit there may be a conflict with user space, but practically I 
>>> think it is very rare). So I think we are very likely going to make 
>>> Kafka
>> internals
>>> to be "headers-aware" anyways.
>>>
>>> Regarding the general API: I think it is a good idea in general, but 
>>> it
>> may
>>> still have limits: note that right now our KIP enforce a header type 
>>> to
>> be
>>> long, and we have a very careful discussion about the fall-back 
>>> policy if header does not have the specified key or if the value is 
>>> not long-typed; but if we enforce long type version in the 
>>> interface, it would require users trying to customizing their 
>>> compaction logic (think: based on some value payload field) to 
>>> transform their fields to long as well. So I feel we can still go 
>>> with the current proposed approach, and only consider
>> this
>>> general API if we observe it does have a general usage requirement. 
>>> By
>> that
>>> time we can still extend the config values of "log.cleaner.compaction.
>>> strategy" to "offset, timestamp, header, myFuncName".
>>>
>>> @Bertus
>>>
>>> Thanks for your feedback, I believe the proposed config is indeed 
>>> for
>> both
>>> global (for the whole broker) and per-topic, Luís can confirm if 
>>> this is the case, and update the wiki page to make it clear.
>>>
>>>
>>> Guozhang
>>>
>>>
>>> On Thu, Aug 16, 2018 at 9:09 AM, Bertus Greeff < 
>>> bgreeff@microsoft.com.invalid> wrote:
>>>
>>>> I'm interested to know the status of this KIP.  I see that the 
>>>> status
>> is
>>>> "Voting".  How long does this normally take?
>>>>
>>>> We want to use Kafka and this KIP provides exactly the log 
>>>> compaction logic that we want for many of our projects.
>>>>
>>>> One piece of feedback that I have is that log.cleaner.compaction.
>>> strategy
>>>> and log.cleaner.compaction.strategy.header needs to be per topic.  
>>>> The text of the KIP makes it sound that the config is only 
>>>> available for
>> all
>>>> topics but this makes no sense.  Different topics will need 
>>>> different strategies and/or headers.
>>>>
>>>> From the KIP:
>>>> Provide the configuration for the individual topics None of the 
>>>> configurations for log compaction are available at topic level, so 
>>>> adding it there is not a part of this KIP
>>>>
>>>>
>>>>
>>>> On 2018/04/05 08:44:00, Luís Cabral <l....@yahoo.com.INVALID> wrote:
>>>>> Hello all,>
>>>>> Starting a discussion for this feature.>
>>>>> KIP-280  :  
>>>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>>>>> Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-&amp;data
>>>>> =02%7C01%7Cbgreeff%40microsoft.com%7C42ac1e21f70c4b0a243108d6103
>>>>> 959ed%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6367142358482
>>>>> 30609&amp;sdata=gtulffYXEDbf%2BNt1b1P%2FkGeDQvmcf2PZ9nQlo4oi%2Bo
>>>>> Q%3D&amp;reserved=0
>>> 280%
>>>> 3A+Enhanced+log+compactionPull-4822 :  
>>>> 3A+Enhanced+log+https://na01.safelinks.protection.outlook.com/?url
>>>> 3A+Enhanced+log+=https%3A%2F%2Fgithub.com%2Fapache%2Fkafk&amp;data
>>>> 3A+Enhanced+log+=02%7C01%7Cbgreeff%40microsoft.com%7C42ac1e21f70c4
>>>> 3A+Enhanced+log+b0a243108d6103959ed%7C72f988bf86f141af91ab2d7cd011
>>>> 3A+Enhanced+log+db47%7C1%7C0%7C636714235848230609&amp;sdata=s5lASk
>>>> 3A+Enhanced+log+fvK%2F21wmK8vsbdlXIXINfFFWZCtXzBLWhdCiA%3D&amp;res
>>>> 3A+Enhanced+log+erved=0
>>>> a/pull/4822>
>>>>
>>>>> Kind Regards,Luís>
>>>>
>>>
>>>
>>>
>>> --
>>> -- Guozhang
>>>
>>
> 
> 
> 
> --
> -- Guozhang
> 
  

Re: [DISCUSS] KIP-280: Enhanced log compaction

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Luis,

What is the status of this KIP?

I tend to agree, that introducing the feature only globally, might be
less useful (I would assume that people want to turn it on, on a
per-topic basis). As I am not familiar with the corresponding code, I
cannot judge the complexity to add topic level configs, however, it
seems to be worth to include it in the KIP.


-Matthias



On 9/21/18 1:59 PM, Bertus Greeff wrote:
> Someone pointed out to me that my scenario is also resolved by using Kafka transactions.  Zombie fencing which is essentially my scenario was one of the scenarios that transactions were designed to solve.  I was going to use the ideas of this KIP to solve it but using transactions seems even better because out of order messages never even make it into the topic.  They are blocked by the broker.
> 
> -----Original Message-----
> From: Guozhang Wang <wa...@gmail.com> 
> Sent: Saturday, September 1, 2018 11:33 AM
> To: dev <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Hello Luis,
> 
> Thanks for your thoughtful responses, here are my two cents:
> 
> 1) I think having the new configs with per-topic granularity would not introduce much memory overhead or logic complexity, as all you need is to remember this at the topic metadata cache. If I've missed some details about the complexity, could you elaborate a bit more?
> 
> 2) I agree with you: the current `ConfigDef.Validator` only scope on the validated config value itself, and hence cannot be dependent on another config.
> 
> 4) I think Jun's point is that since we need the latest message in the log segment for the timestamp tracking, we cannot delete it actually: with offset based only policy, this is naturally guaranteed; but now with other policy, it is not guaranteed to never be compacted away, and hence we need to "enforce" to special-handle this case and not delete it.
> 
> 
> 
> Guozhang
> 
> 
> 
> Guozhang
> 
> 
> On Wed, Aug 29, 2018 at 9:25 AM, Luís Cabral <lu...@yahoo.com.invalid>
> wrote:
> 
>> Hi all,
>>
>> Since there has been a rejuvenated interest in this KIP, it felt 
>> better to downgrade it back down to [DISCUSSION], as we aren't really 
>> voting on it anymore.
>> I'll try to address the currently pending questions on the following 
>> points, so please bear with me while we go through them all:
>>
>> 1) Configuration: Topic vs Global
>>
>> Here we all agree that having a configuration per-topic is the best 
>> option. However, this is not possible with the current design of the 
>> compaction solution. Yes, it is true that "some" properties that 
>> relate to compaction are configurable per-topic, but those are only 
>> the properties that act outside(!) of the actual compaction logic, 
>> such as configuring the start-compaction trigger with "ratio" or 
>> compaction duration with "lag.ms ".
>> This logic can, of course, be re-designed to suit our wishes, but this 
>> is not a direct effort, and if we have spent months arguing about the 
>> extra 8 bytes per record, for sure we would spend another few dozen 
>> months discussing the memory impact that doing this change to the 
>> properties will invariably have.
>> As such, I will limit this KIP to ONLY have these configurations globally.
>>
>> 2) Configuration: Fail-fast vs Fallback
>>
>>
>> Ideally, I would also like to prevent the application to start if the 
>> configuration is somehow invalid.
>> However (another 'however'), the way the configuration is handled 
>> prevents adding dependencies between them, so we can't add logic that 
>> says "configuration X is invalid if configuration Y is so-and-such".
>> Again, this can be re-designed to add this feature to the 
>> configuration logic, but it would again be a big change just by 
>> itself, so this KIP is again limited to use ONLY what is already in place.
>>
>> 3) Documenting the memory impact on the KIP
>>
>> This is now added to the KIP, though this topic is more complicated 
>> than 'memory impact'. E.g.: this change doesn't translate to an actual 
>> memory impact, it just means that the compaction will potentially 
>> encompass less records per execution.
>>
>> 4) Documenting how we deal with the last message in the log
>>
>> I have 2 interpretations for this request: "the last message in the log"
>> or "the last message with a shared key on the log"
>> For the former: there is no change to the logic on how the last 
>> message is handled. Only the "tail" gets compacted, so the "head" 
>> (which includes the last message) still keeps the last message
>>
>> 5) Documenting how the key deletion will be handled
>>
>> I'm having some trouble understanding this one; do you mean how keys 
>> are deleted in general, or?
>>
>> Cheers,
>> Luis Cabral
>>
>>    On Friday, August 24, 2018, 1:54:54 AM GMT+2, Jun Rao 
>> <ju...@confluent.io>
>> wrote:
>>
>>  Hi, Luis,
>>
>> Thanks for the reply. A few more comments below.
>>
>> 1. About the topic level configuration. It seems that it's useful for 
>> the new configs to be at the topic level. Currently, the following 
>> configs related to compaction are already at the topic level.
>>
>> min.cleanable.dirty.ratio
>> min.compaction.lag.ms
>> cleanup.policy
>>
>> 2. Have you documented the memory impact in the KIP?
>>
>> 3. Could you document how we deal with the last message in the log, 
>> which is potentially cleanable now?
>>
>> 4. Could you document how key deletion will be handled?
>>
>> 10. As for Jason's proposal on CompactionStrategy, it does make the 
>> feature more general. On the other hand, it will be useful not to 
>> require user-level code if the compaction value only comes from the header.
>>
>> 20. "If compaction.strategy.header is chosen and 
>> compaction.strategy.header is not set, the KIP falls back to offset." 
>> I am wondering if it's better to just fail the configuration in the case.
>>
>> Jun
>>
>>
>>
>> On Thu, Aug 16, 2018 at 1:33 PM, Guozhang Wang <wa...@gmail.com> wrote:
>>
>>> Regarding "broker-agnostic of headers": there are some KIPs from 
>>> Streams
>> to
>>> use headers for internal purposes as well, e.g. KIP-258 and KIP-213 
>>> (I admit there may be a conflict with user space, but practically I 
>>> think it is very rare). So I think we are very likely going to make 
>>> Kafka
>> internals
>>> to be "headers-aware" anyways.
>>>
>>> Regarding the general API: I think it is a good idea in general, but 
>>> it
>> may
>>> still have limits: note that right now our KIP enforce a header type 
>>> to
>> be
>>> long, and we have a very careful discussion about the fall-back 
>>> policy if header does not have the specified key or if the value is 
>>> not long-typed; but if we enforce long type version in the 
>>> interface, it would require users trying to customizing their 
>>> compaction logic (think: based on some value payload field) to 
>>> transform their fields to long as well. So I feel we can still go 
>>> with the current proposed approach, and only consider
>> this
>>> general API if we observe it does have a general usage requirement. 
>>> By
>> that
>>> time we can still extend the config values of "log.cleaner.compaction.
>>> strategy" to "offset, timestamp, header, myFuncName".
>>>
>>> @Bertus
>>>
>>> Thanks for your feedback, I believe the proposed config is indeed 
>>> for
>> both
>>> global (for the whole broker) and per-topic, Luís can confirm if 
>>> this is the case, and update the wiki page to make it clear.
>>>
>>>
>>> Guozhang
>>>
>>>
>>> On Thu, Aug 16, 2018 at 9:09 AM, Bertus Greeff < 
>>> bgreeff@microsoft.com.invalid> wrote:
>>>
>>>> I'm interested to know the status of this KIP.  I see that the 
>>>> status
>> is
>>>> "Voting".  How long does this normally take?
>>>>
>>>> We want to use Kafka and this KIP provides exactly the log 
>>>> compaction logic that we want for many of our projects.
>>>>
>>>> One piece of feedback that I have is that log.cleaner.compaction.
>>> strategy
>>>> and log.cleaner.compaction.strategy.header needs to be per topic.  
>>>> The text of the KIP makes it sound that the config is only 
>>>> available for
>> all
>>>> topics but this makes no sense.  Different topics will need 
>>>> different strategies and/or headers.
>>>>
>>>> From the KIP:
>>>> Provide the configuration for the individual topics None of the 
>>>> configurations for log compaction are available at topic level, so 
>>>> adding it there is not a part of this KIP
>>>>
>>>>
>>>>
>>>> On 2018/04/05 08:44:00, Luís Cabral <l....@yahoo.com.INVALID> wrote:
>>>>> Hello all,>
>>>>> Starting a discussion for this feature.>
>>>>> KIP-280  :  
>>>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>>>>> Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-&amp;data
>>>>> =02%7C01%7Cbgreeff%40microsoft.com%7C42ac1e21f70c4b0a243108d6103
>>>>> 959ed%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6367142358482
>>>>> 30609&amp;sdata=gtulffYXEDbf%2BNt1b1P%2FkGeDQvmcf2PZ9nQlo4oi%2Bo
>>>>> Q%3D&amp;reserved=0
>>> 280%
>>>> 3A+Enhanced+log+compactionPull-4822 :  
>>>> 3A+Enhanced+log+https://na01.safelinks.protection.outlook.com/?url
>>>> 3A+Enhanced+log+=https%3A%2F%2Fgithub.com%2Fapache%2Fkafk&amp;data
>>>> 3A+Enhanced+log+=02%7C01%7Cbgreeff%40microsoft.com%7C42ac1e21f70c4
>>>> 3A+Enhanced+log+b0a243108d6103959ed%7C72f988bf86f141af91ab2d7cd011
>>>> 3A+Enhanced+log+db47%7C1%7C0%7C636714235848230609&amp;sdata=s5lASk
>>>> 3A+Enhanced+log+fvK%2F21wmK8vsbdlXIXINfFFWZCtXzBLWhdCiA%3D&amp;res
>>>> 3A+Enhanced+log+erved=0
>>>> a/pull/4822>
>>>>
>>>>> Kind Regards,Luís>
>>>>
>>>
>>>
>>>
>>> --
>>> -- Guozhang
>>>
>>
> 
> 
> 
> --
> -- Guozhang
> 


RE: [DISCUSS] KIP-280: Enhanced log compaction

Posted by Bertus Greeff <bg...@microsoft.com.INVALID>.
Someone pointed out to me that my scenario is also resolved by using Kafka transactions.  Zombie fencing which is essentially my scenario was one of the scenarios that transactions were designed to solve.  I was going to use the ideas of this KIP to solve it but using transactions seems even better because out of order messages never even make it into the topic.  They are blocked by the broker.

-----Original Message-----
From: Guozhang Wang <wa...@gmail.com> 
Sent: Saturday, September 1, 2018 11:33 AM
To: dev <de...@kafka.apache.org>
Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction

Hello Luis,

Thanks for your thoughtful responses, here are my two cents:

1) I think having the new configs with per-topic granularity would not introduce much memory overhead or logic complexity, as all you need is to remember this at the topic metadata cache. If I've missed some details about the complexity, could you elaborate a bit more?

2) I agree with you: the current `ConfigDef.Validator` only scope on the validated config value itself, and hence cannot be dependent on another config.

4) I think Jun's point is that since we need the latest message in the log segment for the timestamp tracking, we cannot delete it actually: with offset based only policy, this is naturally guaranteed; but now with other policy, it is not guaranteed to never be compacted away, and hence we need to "enforce" to special-handle this case and not delete it.



Guozhang



Guozhang


On Wed, Aug 29, 2018 at 9:25 AM, Luís Cabral <lu...@yahoo.com.invalid>
wrote:

> Hi all,
>
> Since there has been a rejuvenated interest in this KIP, it felt 
> better to downgrade it back down to [DISCUSSION], as we aren't really 
> voting on it anymore.
> I'll try to address the currently pending questions on the following 
> points, so please bear with me while we go through them all:
>
> 1) Configuration: Topic vs Global
>
> Here we all agree that having a configuration per-topic is the best 
> option. However, this is not possible with the current design of the 
> compaction solution. Yes, it is true that "some" properties that 
> relate to compaction are configurable per-topic, but those are only 
> the properties that act outside(!) of the actual compaction logic, 
> such as configuring the start-compaction trigger with "ratio" or 
> compaction duration with "lag.ms ".
> This logic can, of course, be re-designed to suit our wishes, but this 
> is not a direct effort, and if we have spent months arguing about the 
> extra 8 bytes per record, for sure we would spend another few dozen 
> months discussing the memory impact that doing this change to the 
> properties will invariably have.
> As such, I will limit this KIP to ONLY have these configurations globally.
>
> 2) Configuration: Fail-fast vs Fallback
>
>
> Ideally, I would also like to prevent the application to start if the 
> configuration is somehow invalid.
> However (another 'however'), the way the configuration is handled 
> prevents adding dependencies between them, so we can't add logic that 
> says "configuration X is invalid if configuration Y is so-and-such".
> Again, this can be re-designed to add this feature to the 
> configuration logic, but it would again be a big change just by 
> itself, so this KIP is again limited to use ONLY what is already in place.
>
> 3) Documenting the memory impact on the KIP
>
> This is now added to the KIP, though this topic is more complicated 
> than 'memory impact'. E.g.: this change doesn't translate to an actual 
> memory impact, it just means that the compaction will potentially 
> encompass less records per execution.
>
> 4) Documenting how we deal with the last message in the log
>
> I have 2 interpretations for this request: "the last message in the log"
> or "the last message with a shared key on the log"
> For the former: there is no change to the logic on how the last 
> message is handled. Only the "tail" gets compacted, so the "head" 
> (which includes the last message) still keeps the last message
>
> 5) Documenting how the key deletion will be handled
>
> I'm having some trouble understanding this one; do you mean how keys 
> are deleted in general, or?
>
> Cheers,
> Luis Cabral
>
>    On Friday, August 24, 2018, 1:54:54 AM GMT+2, Jun Rao 
> <ju...@confluent.io>
> wrote:
>
>  Hi, Luis,
>
> Thanks for the reply. A few more comments below.
>
> 1. About the topic level configuration. It seems that it's useful for 
> the new configs to be at the topic level. Currently, the following 
> configs related to compaction are already at the topic level.
>
> min.cleanable.dirty.ratio
> min.compaction.lag.ms
> cleanup.policy
>
> 2. Have you documented the memory impact in the KIP?
>
> 3. Could you document how we deal with the last message in the log, 
> which is potentially cleanable now?
>
> 4. Could you document how key deletion will be handled?
>
> 10. As for Jason's proposal on CompactionStrategy, it does make the 
> feature more general. On the other hand, it will be useful not to 
> require user-level code if the compaction value only comes from the header.
>
> 20. "If compaction.strategy.header is chosen and 
> compaction.strategy.header is not set, the KIP falls back to offset." 
> I am wondering if it's better to just fail the configuration in the case.
>
> Jun
>
>
>
> On Thu, Aug 16, 2018 at 1:33 PM, Guozhang Wang <wa...@gmail.com> wrote:
>
> > Regarding "broker-agnostic of headers": there are some KIPs from 
> > Streams
> to
> > use headers for internal purposes as well, e.g. KIP-258 and KIP-213 
> > (I admit there may be a conflict with user space, but practically I 
> > think it is very rare). So I think we are very likely going to make 
> > Kafka
> internals
> > to be "headers-aware" anyways.
> >
> > Regarding the general API: I think it is a good idea in general, but 
> > it
> may
> > still have limits: note that right now our KIP enforce a header type 
> > to
> be
> > long, and we have a very careful discussion about the fall-back 
> > policy if header does not have the specified key or if the value is 
> > not long-typed; but if we enforce long type version in the 
> > interface, it would require users trying to customizing their 
> > compaction logic (think: based on some value payload field) to 
> > transform their fields to long as well. So I feel we can still go 
> > with the current proposed approach, and only consider
> this
> > general API if we observe it does have a general usage requirement. 
> > By
> that
> > time we can still extend the config values of "log.cleaner.compaction.
> > strategy" to "offset, timestamp, header, myFuncName".
> >
> > @Bertus
> >
> > Thanks for your feedback, I believe the proposed config is indeed 
> > for
> both
> > global (for the whole broker) and per-topic, Luís can confirm if 
> > this is the case, and update the wiki page to make it clear.
> >
> >
> > Guozhang
> >
> >
> > On Thu, Aug 16, 2018 at 9:09 AM, Bertus Greeff < 
> > bgreeff@microsoft.com.invalid> wrote:
> >
> > > I'm interested to know the status of this KIP.  I see that the 
> > > status
> is
> > > "Voting".  How long does this normally take?
> > >
> > > We want to use Kafka and this KIP provides exactly the log 
> > > compaction logic that we want for many of our projects.
> > >
> > > One piece of feedback that I have is that log.cleaner.compaction.
> > strategy
> > > and log.cleaner.compaction.strategy.header needs to be per topic.  
> > > The text of the KIP makes it sound that the config is only 
> > > available for
> all
> > > topics but this makes no sense.  Different topics will need 
> > > different strategies and/or headers.
> > >
> > > From the KIP:
> > > Provide the configuration for the individual topics None of the 
> > > configurations for log compaction are available at topic level, so 
> > > adding it there is not a part of this KIP
> > >
> > >
> > >
> > > On 2018/04/05 08:44:00, Luís Cabral <l....@yahoo.com.INVALID> wrote:
> > > > Hello all,>
> > > > Starting a discussion for this feature.>
> > > > KIP-280  :  
> > > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > > Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-&amp;data
> > > > =02%7C01%7Cbgreeff%40microsoft.com%7C42ac1e21f70c4b0a243108d6103
> > > > 959ed%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6367142358482
> > > > 30609&amp;sdata=gtulffYXEDbf%2BNt1b1P%2FkGeDQvmcf2PZ9nQlo4oi%2Bo
> > > > Q%3D&amp;reserved=0
> > 280%
> > > 3A+Enhanced+log+compactionPull-4822 :  
> > > 3A+Enhanced+log+https://na01.safelinks.protection.outlook.com/?url
> > > 3A+Enhanced+log+=https%3A%2F%2Fgithub.com%2Fapache%2Fkafk&amp;data
> > > 3A+Enhanced+log+=02%7C01%7Cbgreeff%40microsoft.com%7C42ac1e21f70c4
> > > 3A+Enhanced+log+b0a243108d6103959ed%7C72f988bf86f141af91ab2d7cd011
> > > 3A+Enhanced+log+db47%7C1%7C0%7C636714235848230609&amp;sdata=s5lASk
> > > 3A+Enhanced+log+fvK%2F21wmK8vsbdlXIXINfFFWZCtXzBLWhdCiA%3D&amp;res
> > > 3A+Enhanced+log+erved=0
> > > a/pull/4822>
> > >
> > > > Kind Regards,Luís>
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



--
-- Guozhang