You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Brett Rann <br...@zendesk.com.INVALID> on 2018/09/03 01:27:39 UTC

Re: [DISCUSS] KIP-354 Time-based log compaction policy

+1 (non-binding) from me on the interface. I'd like to see someone familiar
with
the code comment on the approach, and note there's a couple of different
approaches: what's documented in the KIP, and what Xiaohe Dong was working
on
here:
https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0

If you have code working already Xiongqi Wu could you share a PR? I'd be
happy
to start testing.

On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <xi...@gmail.com> wrote:

> Hi All,
>
> Do you have any additional comments on this KIP?
>
>
> On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <xi...@gmail.com> wrote:
>
> > on 2)
> > The offsetmap is built starting from dirty segment.
> > The compaction starts from the beginning of the log partition. That's how
> > it ensure the deletion of tomb keys.
> > I will double check tomorrow.
> >
> > Xiongqi (Wesley) Wu
> >
> >
> > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann <br...@zendesk.com.invalid>
> > wrote:
> >
> >> To just clarify a bit on 1. whether there's an external storage/DB isn't
> >> relevant here.
> >> Compacted topics allow a tombstone record to be sent (a null value for a
> >> key) which
> >> currently will result in old values for that key being deleted if some
> >> conditions are met.
> >> There are existing controls to make sure the old values will stay around
> >> for a minimum
> >> time at least, but no dedicated control to ensure the tombstone will
> >> delete
> >> within a
> >> maximum time.
> >>
> >> One popular reason that maximum time for deletion is desirable right now
> >> is
> >> GDPR with
> >> PII. But we're not proposing any GDPR awareness in kafka, just being
> able
> >> to guarantee
> >> a max time where a tombstoned key will be removed from the compacted
> >> topic.
> >>
> >> on 2)
> >> huh, i thought it kept track of the first dirty segment and didn't
> >> recompact older "clean" ones.
> >> But I didn't look at code or test for that.
> >>
> >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <xi...@gmail.com>
> wrote:
> >>
> >> > 1, Owner of data (in this sense, kafka is the not the owner of data)
> >> > should keep track of lifecycle of the data in some external
> storage/DB.
> >> > The owner determines when to delete the data and send the delete
> >> request to
> >> > kafka. Kafka doesn't know about the content of data but to provide a
> >> mean
> >> > for deletion.
> >> >
> >> > 2 , each time compaction runs, it will start from first segments (no
> >> > matter if it is compacted or not). The time estimation here is only
> used
> >> > to determine whether we should run compaction on this log partition.
> So
> >> we
> >> > only need to estimate uncompacted segments.
> >> >
> >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <li...@gmail.com>
> wrote:
> >> >
> >> > > Hey Xiongqi,
> >> > >
> >> > > Thanks for the update. I have two questions for the latest KIP.
> >> > >
> >> > > 1) The motivation section says that one use case is to delete PII
> >> > (Personal
> >> > > Identifiable information) data within 7 days while keeping non-PII
> >> > > indefinitely in compacted format. I suppose the use-case depends on
> >> the
> >> > > application to determine when to delete those PII data. Could you
> >> explain
> >> > > how can application reliably determine the set of keys that should
> be
> >> > > deleted? Is application required to always messages from the topic
> >> after
> >> > > every restart and determine the keys to be deleted by looking at
> >> message
> >> > > timestamp, or is application supposed to persist the key-> timstamp
> >> > > information in a separate persistent storage system?
> >> > >
> >> > > 2) It is mentioned in the KIP that "we only need to estimate
> earliest
> >> > > message timestamp for un-compacted log segments because the deletion
> >> > > requests that belong to compacted segments have already been
> >> processed".
> >> > > Not sure if it is correct. If a segment is compacted before user
> sends
> >> > > message to delete a key in this segment, it seems that we still need
> >> to
> >> > > ensure that the segment will be compacted again within the given
> time
> >> > after
> >> > > the deletion is requested, right?
> >> > >
> >> > > Thanks,
> >> > > Dong
> >> > >
> >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <xi...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > Hi Xiaohe,
> >> > > >
> >> > > > Quick note:
> >> > > > 1) Use minimum of segment.ms and max.compaction.lag.ms
> >> > > > <http://max.compaction.ms
> <http://max.compaction.ms>
> >> > <http://max.compaction.ms
> <http://max.compaction.ms>>>
> >> > > >
> >> > > > 2) I am not sure if I get your second question. first, we have
> >> jitter
> >> > > when
> >> > > > we roll the active segment. second, on each compaction, we compact
> >> upto
> >> > > > the offsetmap could allow. Those will not lead to perfect
> compaction
> >> > > storm
> >> > > > overtime. In addition, I expect we are setting
> >> max.compaction.lag.ms
> >> > on
> >> > > > the order of days.
> >> > > >
> >> > > > 3) I don't have access to the confluent community slack for now. I
> >> am
> >> > > > reachable via the google handle out.
> >> > > > To avoid the double effort, here is my plan:
> >> > > > a) Collect more feedback and feature requriement on the KIP.
> >> > > > b) Wait unitl this KIP is approved.
> >> > > > c) I will address any additional requirements in the
> implementation.
> >> > (My
> >> > > > current implementation only complies to whatever described in the
> >> KIP
> >> > > now)
> >> > > > d) I can share the code with the you and community see you want to
> >> add
> >> > > > anything.
> >> > > > e) submission through committee
> >> > > >
> >> > > >
> >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
> >> dannyrivclo@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Hi Xiongqi
> >> > > > >
> >> > > > > Thanks for thinking about implementing this as well. :)
> >> > > > >
> >> > > > > I was thinking about using `segment.ms` to trigger the segment
> >> roll.
> >> > > > > Also, its value can be the largest time bias for the record
> >> deletion.
> >> > > For
> >> > > > > example, if the `segment.ms` is 1 day and `max.compaction.ms`
> is
> >> 30
> >> > > > days,
> >> > > > > the compaction may happen around 31 days.
> >> > > > >
> >> > > > > For my curiosity, is there a way we can do some performance test
> >> for
> >> > > this
> >> > > > > and any tools you can recommend. As you know, previously, it is
> >> > cleaned
> >> > > > up
> >> > > > > by respecting dirty ratio, but now it may happen anytime if max
> >> lag
> >> > has
> >> > > > > passed for each message. I wonder what would happen if clients
> >> send
> >> > > huge
> >> > > > > amount of tombstone records at the same time.
> >> > > > >
> >> > > > > I am looking forward to have a quick chat with you to avoid
> double
> >> > > effort
> >> > > > > on this. I am in confluent community slack during the work time.
> >> My
> >> > > name
> >> > > > is
> >> > > > > Xiaohe Dong. :)
> >> > > > >
> >> > > > > Rgds
> >> > > > > Xiaohe Dong
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On 2018/08/16 01:22:22, xiongqi wu <xi...@gmail.com> wrote:
> >> > > > > > Brett,
> >> > > > > >
> >> > > > > > Thank you for your comments.
> >> > > > > > I was thinking since we already has immediate compaction
> >> setting by
> >> > > > > setting
> >> > > > > > min dirty ratio to 0, so I decide to use "0" as disabled
> state.
> >> > > > > > I am ok to go with -1(disable), 0 (immediate) options.
> >> > > > > >
> >> > > > > > For the implementation, there are a few differences between
> mine
> >> > and
> >> > > > > > "Xiaohe Dong"'s :
> >> > > > > > 1) I used the estimated creation time of a log segment instead
> >> of
> >> > > > largest
> >> > > > > > timestamp of a log to determine the compaction eligibility,
> >> > because a
> >> > > > log
> >> > > > > > segment might stay as an active segment up to "max compaction
> >> lag".
> >> > > > (see
> >> > > > > > the KIP for detail).
> >> > > > > > 2) I measure how much bytes that we must clean to follow the
> >> "max
> >> > > > > > compaction lag" rule, and use that to determine the order of
> >> > > > compaction.
> >> > > > > > 3) force active segment to roll to follow the "max compaction
> >> lag"
> >> > > > > >
> >> > > > > > I can share my code so we can coordinate.
> >> > > > > >
> >> > > > > > I haven't think about a new API to force a compaction. what is
> >> the
> >> > > use
> >> > > > > case
> >> > > > > > for this one?
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> >> > > <brann@zendesk.com.invalid
> >> > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > We've been looking into this too.
> >> > > > > > >
> >> > > > > > > Mailing list:
> >> > > > > > > https://lists.apache.org/thread.html/
> <https://lists.apache.org/thread.html/>
> >> > <https://lists.apache.org/thread.html/
> <https://lists.apache.org/thread.html/>>
> >> > > ed7f6a6589f94e8c2a705553f364ef
> >> > > > > > > 599cb6915e4c3ba9b561e610e4@%3Cdev.kafka.apache.org%3E
> >> > > > > > > jira wish: https://issues.apache.org/jira/browse/KAFKA-7137
> <https://issues.apache.org/jira/browse/KAFKA-7137>
> >> > <https://issues.apache.org/jira/browse/KAFKA-7137
> <https://issues.apache.org/jira/browse/KAFKA-7137>>
> >> > > > > > > confluent slack discussion:
> >> > > > > > > https://confluentcommunity.slack.com/archives/C49R61XMM/
> <https://confluentcommunity.slack.com/archives/C49R61XMM/>
> >> > <https://confluentcommunity.slack.com/archives/C49R61XMM/
> <https://confluentcommunity.slack.com/archives/C49R61XMM/>>
> >> > > > > p1530760121000039
> >> > > > > > >
> >> > > > > > > A person on my team has started on code so you might want to
> >> > > > > coordinate:
> >> > > > > > > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
> >> > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->>
> >> > > > > > > cleaner-compaction-max-lifetime-2.0
> >> > > > > > >
> >> > > > > > > He's been working with Jason Gustafson and James Chen around
> >> the
> >> > > > > changes.
> >> > > > > > > You can ping him on confluent slack as Xiaohe Dong.
> >> > > > > > >
> >> > > > > > > It's great to know others are thinking on it as well.
> >> > > > > > >
> >> > > > > > > You've added the requirement to force a segment roll which
> we
> >> > > hadn't
> >> > > > > gotten
> >> > > > > > > to yet, which is great. I was content with it not including
> >> the
> >> > > > active
> >> > > > > > > segment.
> >> > > > > > >
> >> > > > > > > > Adding topic level configuration "max.compaction.lag.ms",
> >> and
> >> > > > > > > corresponding broker configuration "
> >> > log.cleaner.max.compaction.la
> >> > > > g.ms
> >> > > > > ",
> >> > > > > > > which is set to 0 (disabled) by default.
> >> > > > > > >
> >> > > > > > > Glancing at some other settings convention seems to me to be
> >> -1
> >> > for
> >> > > > > > > disabled (or infinite, which is more meaningful here). 0 to
> me
> >> > > > implies
> >> > > > > > > instant, a little quicker than 1.
> >> > > > > > >
> >> > > > > > > We've been trying to think about a way to trigger compaction
> >> as
> >> > > well
> >> > > > > > > through an API call, which would need to be flagged
> somewhere
> >> (ZK
> >> > > > > admin/
> >> > > > > > > space?) but we're struggling to think how that would be
> >> > coordinated
> >> > > > > across
> >> > > > > > > brokers and partitions. Have you given any thought to that?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <
> >> xiongqiwu@gmail.com>
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Eno, Dong,
> >> > > > > > > >
> >> > > > > > > > I have updated the KIP. We decide not to address the issue
> >> that
> >> > > we
> >> > > > > might
> >> > > > > > > > have for both compaction and time retention enabled topics
> >> (see
> >> > > the
> >> > > > > > > > rejected alternative item 2). This KIP will only ensure
> log
> >> can
> >> > > be
> >> > > > > > > > compacted after a specified time-interval.
> >> > > > > > > >
> >> > > > > > > > As suggested by Dong, we will also enforce "
> >> > > max.compaction.lag.ms"
> >> > > > > is
> >> > > > > > > not
> >> > > > > > > > less than "min.compaction.lag.ms".
> >> > > > > > > >
> >> > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> >> > > > > Time-based
> >> > > > > > > log
> >> > > > > > > > compaction policy
> >> > > > > > > > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> >> > > > > Time-based
> >> > > > > > > log compaction policy>
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu <
> >> > xiongqiwu@gmail.com
> >> > > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Per discussion with Dong, he made a very good point that
> >> if
> >> > > > > compaction
> >> > > > > > > > > and time based retention are both enabled on a topic,
> the
> >> > > > > compaction
> >> > > > > > > > might
> >> > > > > > > > > prevent records from being deleted on time. The reason
> is
> >> > when
> >> > > > > > > compacting
> >> > > > > > > > > multiple segments into one single segment, the newly
> >> created
> >> > > > > segment
> >> > > > > > > will
> >> > > > > > > > > have same lastmodified timestamp as latest original
> >> segment.
> >> > We
> >> > > > > lose
> >> > > > > > > the
> >> > > > > > > > > timestamp of all original segments except the last one.
> >> As a
> >> > > > > result,
> >> > > > > > > > > records might not be deleted as it should be through
> time
> >> > based
> >> > > > > > > > retention.
> >> > > > > > > > >
> >> > > > > > > > > With the current KIP proposal, if we want to ensure
> timely
> >> > > > > deletion, we
> >> > > > > > > > > have the following configurations:
> >> > > > > > > > > 1) enable time based log compaction only : deletion is
> >> done
> >> > > > though
> >> > > > > > > > > overriding the same key
> >> > > > > > > > > 2) enable time based log retention only: deletion is
> done
> >> > > though
> >> > > > > > > > > time-based retention
> >> > > > > > > > > 3) enable both log compaction and time based retention:
> >> > > Deletion
> >> > > > > is not
> >> > > > > > > > > guaranteed.
> >> > > > > > > > >
> >> > > > > > > > > Not sure if we have use case 3 and also want deletion to
> >> > happen
> >> > > > on
> >> > > > > > > time.
> >> > > > > > > > > There are several options to address deletion issue when
> >> > enable
> >> > > > > both
> >> > > > > > > > > compaction and retention:
> >> > > > > > > > > A) During log compaction, looking into record timestamp
> to
> >> > > delete
> >> > > > > > > expired
> >> > > > > > > > > records. This can be done in compaction logic itself or
> >> use
> >> > > > > > > > > AdminClient.deleteRecords() . But this assumes we have
> >> record
> >> > > > > > > timestamp.
> >> > > > > > > > > B) retain the lastModifed time of original segments
> during
> >> > log
> >> > > > > > > > compaction.
> >> > > > > > > > > This requires extra meta data to record the information
> or
> >> > not
> >> > > > > grouping
> >> > > > > > > > > multiple segments into one during compaction.
> >> > > > > > > > >
> >> > > > > > > > > If we have use case 3 in general, I would prefer
> solution
> >> A
> >> > and
> >> > > > > rely on
> >> > > > > > > > > record timestamp.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Two questions:
> >> > > > > > > > > Do we have use case 3? Is it nice to have or must have?
> >> > > > > > > > > If we have use case 3 and want to go with solution A,
> >> should
> >> > we
> >> > > > > > > introduce
> >> > > > > > > > > a new configuration to enforce deletion by timestamp?
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi wu <
> >> > > xiongqiwu@gmail.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > >> Dong,
> >> > > > > > > > >>
> >> > > > > > > > >> Thanks for the comment.
> >> > > > > > > > >>
> >> > > > > > > > >> There are two retention policy: log compaction and time
> >> > based
> >> > > > > > > retention.
> >> > > > > > > > >>
> >> > > > > > > > >> Log compaction:
> >> > > > > > > > >>
> >> > > > > > > > >> we have use cases to keep infinite retention of a topic
> >> > (only
> >> > > > > > > > >> compaction). GDPR cares about deletion of PII (personal
> >> > > > > identifiable
> >> > > > > > > > >> information) data.
> >> > > > > > > > >> Since Kafka doesn't know what records contain PII, it
> >> relies
> >> > > on
> >> > > > > upper
> >> > > > > > > > >> layer to delete those records.
> >> > > > > > > > >> For those infinite retention uses uses, kafka needs to
> >> > > provide a
> >> > > > > way
> >> > > > > > > to
> >> > > > > > > > >> enforce compaction on time. This is what we try to
> >> address
> >> > in
> >> > > > this
> >> > > > > > > KIP.
> >> > > > > > > > >>
> >> > > > > > > > >> Time based retention,
> >> > > > > > > > >>
> >> > > > > > > > >> There are also use cases that users of Kafka might want
> >> to
> >> > > > expire
> >> > > > > all
> >> > > > > > > > >> their data.
> >> > > > > > > > >> In those cases, they can use time based retention of
> >> their
> >> > > > topics.
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> Regarding your first question, if a user wants to
> delete
> >> a
> >> > key
> >> > > > in
> >> > > > > the
> >> > > > > > > > >> log compaction topic, the user has to send a deletion
> >> using
> >> > > the
> >> > > > > same
> >> > > > > > > > key.
> >> > > > > > > > >> Kafka only makes sure the deletion will happen under a
> >> > certain
> >> > > > > time
> >> > > > > > > > >> periods (like 2 days/7 days).
> >> > > > > > > > >>
> >> > > > > > > > >> Regarding your second question. In most cases, we might
> >> want
> >> > > to
> >> > > > > delete
> >> > > > > > > > >> all duplicated keys at the same time.
> >> > > > > > > > >> Compaction might be more efficient since we need to
> scan
> >> the
> >> > > log
> >> > > > > and
> >> > > > > > > > find
> >> > > > > > > > >> all duplicates. However, the expected use case is to
> set
> >> the
> >> > > > time
> >> > > > > > > based
> >> > > > > > > > >> compaction interval on the order of days, and be larger
> >> than
> >> > > > 'min
> >> > > > > > > > >> compaction lag". We don't want log compaction to happen
> >> > > > frequently
> >> > > > > > > since
> >> > > > > > > > >> it is expensive. The purpose is to help low production
> >> rate
> >> > > > topic
> >> > > > > to
> >> > > > > > > get
> >> > > > > > > > >> compacted on time. For the topic with "normal" incoming
> >> > > message
> >> > > > > > > message
> >> > > > > > > > >> rate, the "min dirty ratio" might have triggered the
> >> > > compaction
> >> > > > > before
> >> > > > > > > > this
> >> > > > > > > > >> time based compaction policy takes effect.
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> Eno,
> >> > > > > > > > >>
> >> > > > > > > > >> For your question, like I mentioned we have long time
> >> > > retention
> >> > > > > use
> >> > > > > > > case
> >> > > > > > > > >> for log compacted topic, but we want to provide ability
> >> to
> >> > > > delete
> >> > > > > > > > certain
> >> > > > > > > > >> PII records on time.
> >> > > > > > > > >> Kafka itself doesn't know whether a record contains
> >> > sensitive
> >> > > > > > > > information
> >> > > > > > > > >> and relies on the user for deletion.
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin <
> >> > > lindong28@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > > > >>
> >> > > > > > > > >>> Hey Xiongqi,
> >> > > > > > > > >>>
> >> > > > > > > > >>> Thanks for the KIP. I have two questions regarding the
> >> > > use-case
> >> > > > > for
> >> > > > > > > > >>> meeting
> >> > > > > > > > >>> GDPR requirement.
> >> > > > > > > > >>>
> >> > > > > > > > >>> 1) If I recall correctly, one of the GDPR requirement
> is
> >> > that
> >> > > > we
> >> > > > > can
> >> > > > > > > > not
> >> > > > > > > > >>> keep messages longer than e.g. 30 days in storage
> (e.g.
> >> > > Kafka).
> >> > > > > Say
> >> > > > > > > > there
> >> > > > > > > > >>> exists a partition p0 which contains message1 with
> key1
> >> and
> >> > > > > message2
> >> > > > > > > > with
> >> > > > > > > > >>> key2. And then user keeps producing messages with
> >> key=key2
> >> > to
> >> > > > > this
> >> > > > > > > > >>> partition. Since message1 with key1 is never
> overridden,
> >> > > sooner
> >> > > > > or
> >> > > > > > > > later
> >> > > > > > > > >>> we
> >> > > > > > > > >>> will want to delete message1 and keep the latest
> message
> >> > with
> >> > > > > > > key=key2.
> >> > > > > > > > >>> But
> >> > > > > > > > >>> currently it looks like log compact logic in Kafka
> will
> >> > > always
> >> > > > > put
> >> > > > > > > > these
> >> > > > > > > > >>> messages in the same segment. Will this be an issue?
> >> > > > > > > > >>>
> >> > > > > > > > >>> 2) The current KIP intends to provide the capability
> to
> >> > > delete
> >> > > > a
> >> > > > > > > given
> >> > > > > > > > >>> message in log compacted topic. Does such use-case
> also
> >> > > require
> >> > > > > Kafka
> >> > > > > > > > to
> >> > > > > > > > >>> keep the messages produced before the given message?
> If
> >> > yes,
> >> > > > > then we
> >> > > > > > > > can
> >> > > > > > > > >>> probably just use AdminClient.deleteRecords() or
> >> time-based
> >> > > log
> >> > > > > > > > retention
> >> > > > > > > > >>> to meet the use-case requirement. If no, do you know
> >> what
> >> > is
> >> > > > the
> >> > > > > > > GDPR's
> >> > > > > > > > >>> requirement on time-to-deletion after user explicitly
> >> > > requests
> >> > > > > the
> >> > > > > > > > >>> deletion
> >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> >> > > > > > > > >>>
> >> > > > > > > > >>> Thanks,
> >> > > > > > > > >>> Dong
> >> > > > > > > > >>>
> >> > > > > > > > >>>
> >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi wu <
> >> > > > xiongqiwu@gmail.com
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > >>>
> >> > > > > > > > >>> > Hi Eno,
> >> > > > > > > > >>> >
> >> > > > > > > > >>> > The GDPR request we are getting here at linkedin is
> >> if we
> >> > > > get a
> >> > > > > > > > >>> request to
> >> > > > > > > > >>> > delete a record through a null key on a log
> compacted
> >> > > topic,
> >> > > > > > > > >>> > we want to delete the record via compaction in a
> given
> >> > time
> >> > > > > period
> >> > > > > > > > >>> like 2
> >> > > > > > > > >>> > days (whatever is required by the policy).
> >> > > > > > > > >>> >
> >> > > > > > > > >>> > There might be other issues (such as orphan log
> >> segments
> >> > > > under
> >> > > > > > > > certain
> >> > > > > > > > >>> > conditions) that lead to GDPR problem but they are
> >> more
> >> > > like
> >> > > > > > > > >>> something we
> >> > > > > > > > >>> > need to fix anyway regardless of GDPR.
> >> > > > > > > > >>> >
> >> > > > > > > > >>> >
> >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
> >> > > > > > > > >>> >
> >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno Thereska <
> >> > > > > > > > eno.thereska@gmail.com>
> >> > > > > > > > >>> > wrote:
> >> > > > > > > > >>> >
> >> > > > > > > > >>> > > Hello,
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > > Thanks for the KIP. I'd like to see a more precise
> >> > > > > definition of
> >> > > > > > > > what
> >> > > > > > > > >>> > part
> >> > > > > > > > >>> > > of GDPR you are targeting as well as some sort of
> >> > > > > verification
> >> > > > > > > that
> >> > > > > > > > >>> this
> >> > > > > > > > >>> > > KIP actually addresses the problem. Right now I
> find
> >> > > this a
> >> > > > > bit
> >> > > > > > > > >>> vague:
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > > "Ability to delete a log message through
> compaction
> >> in
> >> > a
> >> > > > > timely
> >> > > > > > > > >>> manner
> >> > > > > > > > >>> > has
> >> > > > > > > > >>> > > become an important requirement in some use cases
> >> > (e.g.,
> >> > > > > GDPR)"
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > > Is there any guarantee that after this KIP the
> GDPR
> >> > > problem
> >> > > > > is
> >> > > > > > > > >>> solved or
> >> > > > > > > > >>> > do
> >> > > > > > > > >>> > > we need to do something else as well, e.g., more
> >> KIPs?
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > > Thanks
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > > Eno
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM, xiongqi wu <
> >> > > > > xiongqiwu@gmail.com>
> >> > > > > > > > >>> wrote:
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> > > > Hi Kafka,
> >> > > > > > > > >>> > > >
> >> > > > > > > > >>> > > > This KIP tries to address GDPR concern to
> fulfill
> >> > > > deletion
> >> > > > > > > > request
> >> > > > > > > > >>> on
> >> > > > > > > > >>> > > time
> >> > > > > > > > >>> > > > through time-based log compaction on a
> compaction
> >> > > enabled
> >> > > > > > > topic:
> >> > > > > > > > >>> > > >
> >> > > > > > > > >>> > > >
> >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> >> > > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>>
> >> > > > > > > > >>> > > > 354%3A+Time-based+log+compaction+policy
> >> > > > > > > > >>> > > >
> >> > > > > > > > >>> > > > Any feedback will be appreciated.
> >> > > > > > > > >>> > > >
> >> > > > > > > > >>> > > >
> >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
> >> > > > > > > > >>> > > >
> >> > > > > > > > >>> > >
> >> > > > > > > > >>> >
> >> > > > > > > > >>>
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> --
> >> > > > > > > > >> Xiongqi (Wesley) Wu
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > > Xiongqi (Wesley) Wu
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Xiongqi (Wesley) Wu
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > >
> >> > > > > > > Brett Rann
> >> > > > > > >
> >> > > > > > > Senior DevOps Engineer
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Zendesk International Ltd
> >> > > > > > >
> >> > > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> >> > > > > > >
> >> > > > > > > Mobile: +61 (0) 418 826 017
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Xiongqi (Wesley) Wu
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Xiongqi (Wesley) Wu
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Xiongqi (Wesley) Wu
> >> >
> >>
> >>
> >> --
> >>
> >> Brett Rann
> >>
> >> Senior DevOps Engineer
> >>
> >>
> >> Zendesk International Ltd
> >>
> >> 395 Collins Street, Melbourne VIC 3000 Australia
> >>
> >> Mobile: +61 (0) 418 826 017
> >>
> >
>
>
> --
> Xiongqi (Wesley) Wu
>


-- 

Brett Rann

Senior DevOps Engineer


Zendesk International Ltd

395 Collins Street, Melbourne VIC 3000 Australia

Mobile: +61 (0) 418 826 017

Re: [DISCUSS] KIP-354 Time-based log compaction policy

Posted by xiongqi wu <xi...@gmail.com>.
Dong,

Thanks for the comments.   I have updated the KIP based on your comments.

below is reply to your questions:

1.  We only calculate this metric for log compaction that is determined by
max compaction lag. So we only collect non-negative metrics.  The log
cleaner is consistently running with some back off time if no job needs to
be done.
The max is the max among all log cleaner threads in their latest run not
the historical max.  This is similar to existing metric
"max-clean-time-secs".  I now mentioned this is metric from each thread in
the KIP.
User can look at the historical data to track how delay changes over time
(similar as other log cleaner metrics).

Another way of defining this metric is : "compaction_finish_time -
earliest_timestamp_of_first_uncompacted_segment".  So it is not w.r.t.
However,  max compaction lag may vary for different topics, and this
doesn't really tell how soon a compaction request is fulfilled after max
compaction lag.  What do you think?

2.  This is intent to track whether the latest logs compacted are
determined by max compaction lag.
The metric will be updated for each log cleaner run. If there are 2 two log
cleaner threads, and they both work on log partitions determined by "max
compaction lag" in their last run,  the value of this metric will be 2.
The previous metric doesn't provide this information if there are more than
one log cleaner thread.

3. I meant to say it is required to be picked up by log compaction after
this max lag. But the actual compaction finish time may vary, since the log
cleaner may take time to finish compaction on this partition or log cleaner
may work on other partition first.
Guarantee may be misleading, I have updated the KIP.

4. It is determined based on the cleaner checkpoint file.  This KIP doesn't
change how broker determined the un-compacted segments.
5.  done.
6.  Why should we need to make this feature depends upon message
timestamp?  "segment.largestTimestamp - maxSegmentMs" is
a reasonable estimate to determine the violation of max compaction lag,
and this estimate is only needed if the first segment of a log partition is
un-compacted.
7.  I removed unrelated part, and specifically mentioned the added
metric "num-logs-compacted-by-max-compaction-lag"
can be used for this performance impact measurement.

Xiongqi (Wesley) Wu


On Tue, Nov 6, 2018 at 6:50 PM Dong Lin <li...@gmail.com> wrote:

> Hey Xiongqi,
>
> Thanks for the update. A few more comments below
>
> 1) According to the definition of
> kafka.log:type=LogCleaner,name=max-compaction-delay, it seems that the
> metric value will be a large negative number if max.compaction.lag.ms is
> MAX_LONG. Would this be a problem? Also, it seems weird that the value of
> the metric is defined w.r.t. how often the log cleaner is run.
>
> 2) Not sure if we need the metric num-logs-compacted-by-max-compaction-lag
> in addition to max-compaction-delay. It seems that operator can just use
> max-compaction-delay to determine whether the max.compaction.lag is
> properly enforced in a quantitative manner. Also, the metric name
> `num-logs-compacted-by-max-compaction-lag` is inconsistent with its
> intended meaning, i.e. the number of logs that needs to be compacted due to
> max.compaction.lag but not yet compacted. So it is probably simple to just
> remove this metric.
>
> 3) The KIP currently says that "a message record has a guaranteed
> upper-bound in time to become mandatory for compaction". The word
> "guarantee" may be misleading because the message may still not be
> compacted within max.compaction.lag after its creation. Could you clarify
> the exact semantics of the max.compaction.lag.ms in the Public Interface
> section?
>
> 4) The KIP's proposed change will estimate earliest message timestamp for
> un-compacted log segments. Can you explain how broker determines whether a
> segment has been compacted after the broker is restarted?
>
> 5) 2.b in Proposed Change section provides two way to get timestamp. To
> make the KIP easier to read for future reference, could we just mention the
> method that we plan to use and move the other solution to the rejected
> alternative section?
>
> 6) Based on the discussion (i.e. point 2 in the previous email), it is said
> that we can assume all messages have timestamp and the feature added in
> this KIP can be skipped for those messages which do not have timestamp. So
> do we still need to use "segment.largestTimestamp - maxSegmentMs" in
> Proposed Change section 2.a?
>
> 7) Based on the discussion (i.e. point 8 in the previous email), if this
> KIP requires user to monitor certain existing metrics for performance
> impact added in this KIP, can we list the metrics in the KIP for user's
> convenience?
>
>
> Thanks,
> Dong
>
> On Mon, Oct 29, 2018 at 3:16 PM xiongqi wu <xi...@gmail.com> wrote:
>
> > Hi Dong,
> > I have updated the KIP to address your comments.
> > One correction to previous Email:
> > after offline discussion with Dong,  we decide to use MAX_LONG as default
> > value for max.compaction.lag.ms.
> >
> >
> > Xiongqi (Wesley) Wu
> >
> >
> > On Mon, Oct 29, 2018 at 12:15 PM xiongqi wu <xi...@gmail.com> wrote:
> >
> > > Hi Dong,
> > >
> > > Thank you for your comment.  See my inline comments.
> > > I will update the KIP shortly.
> > >
> > > Xiongqi (Wesley) Wu
> > >
> > >
> > > On Sun, Oct 28, 2018 at 9:17 PM Dong Lin <li...@gmail.com> wrote:
> > >
> > >> Hey Xiongqi,
> > >>
> > >> Sorry for late reply. I have some comments below:
> > >>
> > >> 1) As discussed earlier in the email list, if the topic is configured
> > with
> > >> both deletion and compaction, in some cases messages produced a long
> > time
> > >> ago can not be deleted based on time. This is a valid use-case because
> > we
> > >> actually have topic which is configured with both deletion and
> > compaction
> > >> policy. And we should enforce the semantics for both policy. Solution
> A
> > >> sounds good. We do not need interface change (e.g. extra config) to
> > >> enforce
> > >> solution A. All we need is to update implementation so that when
> broker
> > >> compacts a topic, if the message has timestamp (which is the common
> > case),
> > >> messages that are too old (based on the time-based retention config)
> > will
> > >> be discarded. Since this is a valid issue and it is also related to
> the
> > >> guarantee of when a message can be deleted, can we include the
> solution
> > of
> > >> this problem in the KIP?
> > >>
> > > ======  This makes sense.  We can use similar approach to increase the
> > log
> > > start offset.
> > >
> > >>
> > >> 2) It is probably OK to assume that all messages have timestamp. The
> > >> per-message timestamp was introduced into Kafka 0.10.0 with KIP-31 and
> > >> KIP-32 as of Feb 2016. Kafka 0.10.0 or earlier versions are no longer
> > >> supported. Also, since the use-case for this feature is primarily for
> > >> GDPR,
> > >> we can assume that client library has already been upgraded to support
> > >> SSL,
> > >> which feature is added after KIP-31 and KIP-32.
> > >>
> > >>  =========>  Ok. We can use message timestamp to delete expired
> records
> > > if both compaction and retention are enabled.
> > >
> > >
> > > 3) In Proposed Change section 2.a, it is said that
> > segment.largestTimestamp
> > >> - maxSegmentMs can be used to determine the timestamp of the earliest
> > >> message. Would it be simpler to just use the create time of the file
> to
> > >> determine the time?
> > >>
> > >> ========>  Linux/Java doesn't provide API for file creation time
> because
> > > some filesystem type doesn't provide file creation time.
> > >
> > >
> > >> 4) The KIP suggests to use must-clean-ratio to select the partition to
> > be
> > >> compacted. Unlike dirty ratio which is mostly for performance, the
> logs
> > >> whose "must-clean-ratio" is non-zero must be compacted immediately for
> > >> correctness reason (and for GDPR). And if this can no be achieved
> > because
> > >> e.g. broker compaction throughput is too low, investigation will be
> > >> needed.
> > >> So it seems simpler to first compact logs which has segment whose
> > earliest
> > >> timetamp is earlier than now - max.compaction.lag.ms, instead of
> > defining
> > >> must-clean-ratio and sorting logs based on this value.
> > >>
> > >>
> > > ======>  Good suggestion. This can simply the implementation quite a
> bit
> > > if we are not too concerned about compaction of GDPR required partition
> > > queued behind some large partition.  The actual compaction completion
> > time
> > > is not guaranteed anyway.
> > >
> > >
> > >> 5) The KIP says max.compaction.lag.ms is 0 by default and it is also
> > >> suggested that 0 means disable. Should we set this value to MAX_LONG
> by
> > >> default to effectively disable the feature added in this KIP?
> > >>
> > >> ====> I would rather use 0 so the corresponding code path will not be
> > > exercised.  By using MAX_LONG, we would theoretically go through
> related
> > > code to find out whether the partition is required to be compacted to
> > > satisfy MAX_LONG.
> > >
> > > 6) It is probably cleaner and readable not to include in Public
> Interface
> > >> section those configs whose meaning is not changed.
> > >>
> > >> ====> I will clean that up.
> > >
> > > 7) The goal of this KIP is to ensure that log segment whose earliest
> > >> message is earlier than a given threshold will be compacted. This goal
> > may
> > >> not be achieved if the compact throughput can not catchup with the
> total
> > >> bytes-in-rate for the compacted topics on the broker. Thus we need an
> > easy
> > >> way to tell operator whether this goal is achieved. If we don't
> already
> > >> have such metric, maybe we can include metrics to show 1) the total
> > number
> > >> of log segments (or logs) which needs to be immediately compacted as
> > >> determined by max.compaction.lag; and 2) the maximum value of now -
> > >> earliest_time_stamp_of_segment among all segments that needs to be
> > >> compacted.
> > >>
> > >> =======> good suggestion.  I will update KIP for these metrics.
> > >
> > > 8) The Performance Impact suggests user to use the existing metrics to
> > >> monitor the performance impact of this KIP. It i useful to list mean
> of
> > >> each jmx metrics that we want user to monitor, and possibly explain
> how
> > to
> > >> interpret the value of these metrics to determine whether there is
> > >> performance issue.
> > >>
> > >> =========>  I will update the KIP.
> > >
> > >> Thanks,
> > >> Dong
> > >>
> > >> On Tue, Oct 16, 2018 at 10:53 AM xiongqi wu <xi...@gmail.com>
> > wrote:
> > >>
> > >> > Mayuresh,
> > >> >
> > >> > Thanks for the comments.
> > >> > The requirement is that we need to pick up segments that are older
> > than
> > >> > maxCompactionLagMs for compaction.
> > >> > maxCompactionLagMs is an upper-bound, which implies that picking up
> > >> > segments for compaction earlier doesn't violated the policy.
> > >> > We use the creation time of a segment as an estimation of its
> records
> > >> > arrival time, so these records can be compacted no later than
> > >> > maxCompactionLagMs.
> > >> >
> > >> > On the other hand, compaction is an expensive operation, we don't
> want
> > >> to
> > >> > compact the log partition whenever a new segment is sealed.
> > >> > Therefore, we want to pick up a segment for compaction when the
> > segment
> > >> is
> > >> > closed to mandatory max compaction lag (so we use segment creation
> > time
> > >> as
> > >> > an estimation.)
> > >> >
> > >> >
> > >> > Xiongqi (Wesley) Wu
> > >> >
> > >> >
> > >> > On Mon, Oct 15, 2018 at 5:54 PM Mayuresh Gharat <
> > >> > gharatmayuresh15@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Hi Wesley,
> > >> > >
> > >> > > Thanks for the KIP and sorry for being late to the party.
> > >> > >  I wanted to understand, the scenario you mentioned in Proposed
> > >> changes :
> > >> > >
> > >> > > -
> > >> > > >
> > >> > > > Estimate the earliest message timestamp of an un-compacted log
> > >> segment.
> > >> > > we
> > >> > > > only need to estimate earliest message timestamp for
> un-compacted
> > >> log
> > >> > > > segments to ensure timely compaction because the deletion
> requests
> > >> that
> > >> > > > belong to compacted segments have already been processed.
> > >> > > >
> > >> > > >    1.
> > >> > > >
> > >> > > >    for the first (earliest) log segment:  The estimated earliest
> > >> > > >    timestamp is set to the timestamp of the first message if
> > >> timestamp
> > >> > is
> > >> > > >    present in the message. Otherwise, the estimated earliest
> > >> timestamp
> > >> > > is set
> > >> > > >    to "segment.largestTimestamp - maxSegmentMs”
> > >> > > >     (segment.largestTimestamp is lastModified time of the log
> > >> segment
> > >> > or
> > >> > > max
> > >> > > >    timestamp we see for the log segment.). In the later case,
> the
> > >> > actual
> > >> > > >    timestamp of the first message might be later than the
> > >> estimation,
> > >> > > but it
> > >> > > >    is safe to pick up the log for compaction earlier.
> > >> > > >
> > >> > > > When we say "actual timestamp of the first message might be
> later
> > >> than
> > >> > > the
> > >> > > estimation, but it is safe to pick up the log for compaction
> > >> earlier.",
> > >> > > doesn't that violate the assumption that we will consider a
> segment
> > >> for
> > >> > > compaction only if the time of creation the segment has crossed
> the
> > >> "now
> > >> > -
> > >> > > maxCompactionLagMs" ?
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Mayuresh
> > >> > >
> > >> > > On Mon, Sep 3, 2018 at 7:28 PM Brett Rann
> <brann@zendesk.com.invalid
> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Might also be worth moving to a vote thread? Discussion seems to
> > >> have
> > >> > > gone
> > >> > > > as far as it can.
> > >> > > >
> > >> > > > > On 4 Sep 2018, at 12:08, xiongqi wu <xi...@gmail.com>
> > wrote:
> > >> > > > >
> > >> > > > > Brett,
> > >> > > > >
> > >> > > > > Yes, I will post PR tomorrow.
> > >> > > > >
> > >> > > > > Xiongqi (Wesley) Wu
> > >> > > > >
> > >> > > > >
> > >> > > > > On Sun, Sep 2, 2018 at 6:28 PM Brett Rann
> > >> <brann@zendesk.com.invalid
> > >> > >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > +1 (non-binding) from me on the interface. I'd like to see
> > >> someone
> > >> > > > familiar
> > >> > > > > > with
> > >> > > > > > the code comment on the approach, and note there's a couple
> of
> > >> > > > different
> > >> > > > > > approaches: what's documented in the KIP, and what Xiaohe
> Dong
> > >> was
> > >> > > > working
> > >> > > > > > on
> > >> > > > > > here:
> > >> > > > > >
> > >> > > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0
> > >> > > > > >
> > >> > > > > > If you have code working already Xiongqi Wu could you share
> a
> > >> PR?
> > >> > I'd
> > >> > > > be
> > >> > > > > > happy
> > >> > > > > > to start testing.
> > >> > > > > >
> > >> > > > > > On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <
> > xiongqiwu@gmail.com
> > >> >
> > >> > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi All,
> > >> > > > > > >
> > >> > > > > > > Do you have any additional comments on this KIP?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <
> > >> xiongqiwu@gmail.com
> > >> > >
> > >> > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > on 2)
> > >> > > > > > > > The offsetmap is built starting from dirty segment.
> > >> > > > > > > > The compaction starts from the beginning of the log
> > >> partition.
> > >> > > > That's
> > >> > > > > > how
> > >> > > > > > > > it ensure the deletion of tomb keys.
> > >> > > > > > > > I will double check tomorrow.
> > >> > > > > > > >
> > >> > > > > > > > Xiongqi (Wesley) Wu
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann
> > >> > > > <br...@zendesk.com.invalid>
> > >> > > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > >> To just clarify a bit on 1. whether there's an external
> > >> > > storage/DB
> > >> > > > > > isn't
> > >> > > > > > > >> relevant here.
> > >> > > > > > > >> Compacted topics allow a tombstone record to be sent (a
> > >> null
> > >> > > value
> > >> > > > > > for a
> > >> > > > > > > >> key) which
> > >> > > > > > > >> currently will result in old values for that key being
> > >> deleted
> > >> > > if
> > >> > > > some
> > >> > > > > > > >> conditions are met.
> > >> > > > > > > >> There are existing controls to make sure the old values
> > >> will
> > >> > > stay
> > >> > > > > > around
> > >> > > > > > > >> for a minimum
> > >> > > > > > > >> time at least, but no dedicated control to ensure the
> > >> > tombstone
> > >> > > > will
> > >> > > > > > > >> delete
> > >> > > > > > > >> within a
> > >> > > > > > > >> maximum time.
> > >> > > > > > > >>
> > >> > > > > > > >> One popular reason that maximum time for deletion is
> > >> desirable
> > >> > > > right
> > >> > > > > > now
> > >> > > > > > > >> is
> > >> > > > > > > >> GDPR with
> > >> > > > > > > >> PII. But we're not proposing any GDPR awareness in
> kafka,
> > >> just
> > >> > > > being
> > >> > > > > > > able
> > >> > > > > > > >> to guarantee
> > >> > > > > > > >> a max time where a tombstoned key will be removed from
> > the
> > >> > > > compacted
> > >> > > > > > > >> topic.
> > >> > > > > > > >>
> > >> > > > > > > >> on 2)
> > >> > > > > > > >> huh, i thought it kept track of the first dirty segment
> > and
> > >> > > didn't
> > >> > > > > > > >> recompact older "clean" ones.
> > >> > > > > > > >> But I didn't look at code or test for that.
> > >> > > > > > > >>
> > >> > > > > > > >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <
> > >> > > xiongqiwu@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > > >>
> > >> > > > > > > >> > 1, Owner of data (in this sense, kafka is the not the
> > >> owner
> > >> > of
> > >> > > > data)
> > >> > > > > > > >> > should keep track of lifecycle of the data in some
> > >> external
> > >> > > > > > > storage/DB.
> > >> > > > > > > >> > The owner determines when to delete the data and send
> > the
> > >> > > delete
> > >> > > > > > > >> request to
> > >> > > > > > > >> > kafka. Kafka doesn't know about the content of data
> but
> > >> to
> > >> > > > provide a
> > >> > > > > > > >> mean
> > >> > > > > > > >> > for deletion.
> > >> > > > > > > >> >
> > >> > > > > > > >> > 2 , each time compaction runs, it will start from
> first
> > >> > > > segments (no
> > >> > > > > > > >> > matter if it is compacted or not). The time
> estimation
> > >> here
> > >> > is
> > >> > > > only
> > >> > > > > > > used
> > >> > > > > > > >> > to determine whether we should run compaction on this
> > log
> > >> > > > partition.
> > >> > > > > > > So
> > >> > > > > > > >> we
> > >> > > > > > > >> > only need to estimate uncompacted segments.
> > >> > > > > > > >> >
> > >> > > > > > > >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <
> > >> > > lindong28@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > > >> >
> > >> > > > > > > >> > > Hey Xiongqi,
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > Thanks for the update. I have two questions for the
> > >> latest
> > >> > > > KIP.
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > 1) The motivation section says that one use case is
> > to
> > >> > > delete
> > >> > > > PII
> > >> > > > > > > >> > (Personal
> > >> > > > > > > >> > > Identifiable information) data within 7 days while
> > >> keeping
> > >> > > > non-PII
> > >> > > > > > > >> > > indefinitely in compacted format. I suppose the
> > >> use-case
> > >> > > > depends
> > >> > > > > > on
> > >> > > > > > > >> the
> > >> > > > > > > >> > > application to determine when to delete those PII
> > data.
> > >> > > Could
> > >> > > > you
> > >> > > > > > > >> explain
> > >> > > > > > > >> > > how can application reliably determine the set of
> > keys
> > >> > that
> > >> > > > should
> > >> > > > > > > be
> > >> > > > > > > >> > > deleted? Is application required to always messages
> > >> from
> > >> > the
> > >> > > > topic
> > >> > > > > > > >> after
> > >> > > > > > > >> > > every restart and determine the keys to be deleted
> by
> > >> > > looking
> > >> > > > at
> > >> > > > > > > >> message
> > >> > > > > > > >> > > timestamp, or is application supposed to persist
> the
> > >> key->
> > >> > > > > > timstamp
> > >> > > > > > > >> > > information in a separate persistent storage
> system?
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > 2) It is mentioned in the KIP that "we only need to
> > >> > estimate
> > >> > > > > > > earliest
> > >> > > > > > > >> > > message timestamp for un-compacted log segments
> > because
> > >> > the
> > >> > > > > > deletion
> > >> > > > > > > >> > > requests that belong to compacted segments have
> > already
> > >> > been
> > >> > > > > > > >> processed".
> > >> > > > > > > >> > > Not sure if it is correct. If a segment is
> compacted
> > >> > before
> > >> > > > user
> > >> > > > > > > sends
> > >> > > > > > > >> > > message to delete a key in this segment, it seems
> > that
> > >> we
> > >> > > > still
> > >> > > > > > need
> > >> > > > > > > >> to
> > >> > > > > > > >> > > ensure that the segment will be compacted again
> > within
> > >> the
> > >> > > > given
> > >> > > > > > > time
> > >> > > > > > > >> > after
> > >> > > > > > > >> > > the deletion is requested, right?
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > Thanks,
> > >> > > > > > > >> > > Dong
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <
> > >> > > > xiongqiwu@gmail.com
> > >> > > > > > >
> > >> > > > > > > >> > wrote:
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > > Hi Xiaohe,
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > Quick note:
> > >> > > > > > > >> > > > 1) Use minimum of segment.ms and
> > >> max.compaction.lag.ms
> > >> > > > > > > >> > > > <http://max.compaction.ms
> > >> > > > > > > <http://max.compaction.ms>
> > >> > > > > > > >> > <http://max.compaction.ms
> > >> > > > > > > <http://max.compaction.ms>>>
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > 2) I am not sure if I get your second question.
> > >> first,
> > >> > we
> > >> > > > have
> > >> > > > > > > >> jitter
> > >> > > > > > > >> > > when
> > >> > > > > > > >> > > > we roll the active segment. second, on each
> > >> compaction,
> > >> > we
> > >> > > > > > compact
> > >> > > > > > > >> upto
> > >> > > > > > > >> > > > the offsetmap could allow. Those will not lead to
> > >> > perfect
> > >> > > > > > > compaction
> > >> > > > > > > >> > > storm
> > >> > > > > > > >> > > > overtime. In addition, I expect we are setting
> > >> > > > > > > >> max.compaction.lag.ms
> > >> > > > > > > >> > on
> > >> > > > > > > >> > > > the order of days.
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > 3) I don't have access to the confluent community
> > >> slack
> > >> > > for
> > >> > > > > > now. I
> > >> > > > > > > >> am
> > >> > > > > > > >> > > > reachable via the google handle out.
> > >> > > > > > > >> > > > To avoid the double effort, here is my plan:
> > >> > > > > > > >> > > > a) Collect more feedback and feature requriement
> on
> > >> the
> > >> > > KIP.
> > >> > > > > > > >> > > > b) Wait unitl this KIP is approved.
> > >> > > > > > > >> > > > c) I will address any additional requirements in
> > the
> > >> > > > > > > implementation.
> > >> > > > > > > >> > (My
> > >> > > > > > > >> > > > current implementation only complies to whatever
> > >> > described
> > >> > > > in
> > >> > > > > > the
> > >> > > > > > > >> KIP
> > >> > > > > > > >> > > now)
> > >> > > > > > > >> > > > d) I can share the code with the you and
> community
> > >> see
> > >> > you
> > >> > > > want
> > >> > > > > > to
> > >> > > > > > > >> add
> > >> > > > > > > >> > > > anything.
> > >> > > > > > > >> > > > e) submission through committee
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
> > >> > > > > > > >> dannyrivclo@gmail.com>
> > >> > > > > > > >> > > > wrote:
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > > Hi Xiongqi
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > Thanks for thinking about implementing this as
> > >> well.
> > >> > :)
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > I was thinking about using `segment.ms` to
> > trigger
> > >> > the
> > >> > > > > > segment
> > >> > > > > > > >> roll.
> > >> > > > > > > >> > > > > Also, its value can be the largest time bias
> for
> > >> the
> > >> > > > record
> > >> > > > > > > >> deletion.
> > >> > > > > > > >> > > For
> > >> > > > > > > >> > > > > example, if the `segment.ms` is 1 day and `
> > >> > > > max.compaction.ms`
> > >> > > > > > > is
> > >> > > > > > > >> 30
> > >> > > > > > > >> > > > days,
> > >> > > > > > > >> > > > > the compaction may happen around 31 days.
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > For my curiosity, is there a way we can do some
> > >> > > > performance
> > >> > > > > > test
> > >> > > > > > > >> for
> > >> > > > > > > >> > > this
> > >> > > > > > > >> > > > > and any tools you can recommend. As you know,
> > >> > > previously,
> > >> > > > it
> > >> > > > > > is
> > >> > > > > > > >> > cleaned
> > >> > > > > > > >> > > > up
> > >> > > > > > > >> > > > > by respecting dirty ratio, but now it may
> happen
> > >> > anytime
> > >> > > > if
> > >> > > > > > max
> > >> > > > > > > >> lag
> > >> > > > > > > >> > has
> > >> > > > > > > >> > > > > passed for each message. I wonder what would
> > >> happen if
> > >> > > > clients
> > >> > > > > > > >> send
> > >> > > > > > > >> > > huge
> > >> > > > > > > >> > > > > amount of tombstone records at the same time.
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > I am looking forward to have a quick chat with
> > you
> > >> to
> > >> > > > avoid
> > >> > > > > > > double
> > >> > > > > > > >> > > effort
> > >> > > > > > > >> > > > > on this. I am in confluent community slack
> during
> > >> the
> > >> > > work
> > >> > > > > > time.
> > >> > > > > > > >> My
> > >> > > > > > > >> > > name
> > >> > > > > > > >> > > > is
> > >> > > > > > > >> > > > > Xiaohe Dong. :)
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > Rgds
> > >> > > > > > > >> > > > > Xiaohe Dong
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > On 2018/08/16 01:22:22, xiongqi wu <
> > >> > xiongqiwu@gmail.com
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > > > >> > > > > > Brett,
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > Thank you for your comments.
> > >> > > > > > > >> > > > > > I was thinking since we already has immediate
> > >> > > compaction
> > >> > > > > > > >> setting by
> > >> > > > > > > >> > > > > setting
> > >> > > > > > > >> > > > > > min dirty ratio to 0, so I decide to use "0"
> as
> > >> > > disabled
> > >> > > > > > > state.
> > >> > > > > > > >> > > > > > I am ok to go with -1(disable), 0 (immediate)
> > >> > options.
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > For the implementation, there are a few
> > >> differences
> > >> > > > between
> > >> > > > > > > mine
> > >> > > > > > > >> > and
> > >> > > > > > > >> > > > > > "Xiaohe Dong"'s :
> > >> > > > > > > >> > > > > > 1) I used the estimated creation time of a
> log
> > >> > segment
> > >> > > > > > instead
> > >> > > > > > > >> of
> > >> > > > > > > >> > > > largest
> > >> > > > > > > >> > > > > > timestamp of a log to determine the
> compaction
> > >> > > > eligibility,
> > >> > > > > > > >> > because a
> > >> > > > > > > >> > > > log
> > >> > > > > > > >> > > > > > segment might stay as an active segment up to
> > >> "max
> > >> > > > > > compaction
> > >> > > > > > > >> lag".
> > >> > > > > > > >> > > > (see
> > >> > > > > > > >> > > > > > the KIP for detail).
> > >> > > > > > > >> > > > > > 2) I measure how much bytes that we must
> clean
> > to
> > >> > > > follow the
> > >> > > > > > > >> "max
> > >> > > > > > > >> > > > > > compaction lag" rule, and use that to
> determine
> > >> the
> > >> > > > order of
> > >> > > > > > > >> > > > compaction.
> > >> > > > > > > >> > > > > > 3) force active segment to roll to follow the
> > >> "max
> > >> > > > > > compaction
> > >> > > > > > > >> lag"
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > I can share my code so we can coordinate.
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > I haven't think about a new API to force a
> > >> > compaction.
> > >> > > > what
> > >> > > > > > is
> > >> > > > > > > >> the
> > >> > > > > > > >> > > use
> > >> > > > > > > >> > > > > case
> > >> > > > > > > >> > > > > > for this one?
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> > >> > > > > > > >> > > <brann@zendesk.com.invalid
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > > wrote:
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > > We've been looking into this too.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > Mailing list:
> > >> > > > > > > >> > > > > > > https://lists.apache.org/thread.html/
> > >> > > > > > > <https://lists.apache.org/thread.html/>
> > >> > > > > > > >> > <https://lists.apache.org/thread.html/
> > >> > > > > > > <https://lists.apache.org/thread.html/>>
> > >> > > > > > > >> > > ed7f6a6589f94e8c2a705553f364ef
> > >> > > > > > > >> > > > > > > 599cb6915e4c3ba9b561e610e4@%
> > >> > 3Cdev.kafka.apache.org
> > >> > > %3E
> > >> > > > > > > >> > > > > > > jira wish:
> > >> > > > > > https://issues.apache.org/jira/browse/KAFKA-7137
> > >> > > > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>
> > >> > > > > > > >> > <https://issues.apache.org/jira/browse/KAFKA-7137
> > >> > > > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>>
> > >> > > > > > > >> > > > > > > confluent slack discussion:
> > >> > > > > > > >> > > > > > >
> > >> > > > https://confluentcommunity.slack.com/archives/C49R61XMM/
> > >> > > > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/
> >
> > >> > > > > > > >> > <
> > >> https://confluentcommunity.slack.com/archives/C49R61XMM/
> > >> > > > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/
> >>
> > >> > > > > > > >> > > > > p1530760121000039
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > A person on my team has started on code so
> > you
> > >> > might
> > >> > > > want
> > >> > > > > > to
> > >> > > > > > > >> > > > > coordinate:
> > >> > > > > > > >> > > > > > >
> > >> > > > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > >> > > > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> >
> > >> > > > > > > >> > <
> > >> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > >> > > > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> >>
> > >> > > > > > > >> > > > > > > cleaner-compaction-max-lifetime-2.0
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > He's been working with Jason Gustafson and
> > >> James
> > >> > > Chen
> > >> > > > > > around
> > >> > > > > > > >> the
> > >> > > > > > > >> > > > > changes.
> > >> > > > > > > >> > > > > > > You can ping him on confluent slack as
> Xiaohe
> > >> > Dong.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > It's great to know others are thinking on
> it
> > as
> > >> > > well.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > You've added the requirement to force a
> > segment
> > >> > roll
> > >> > > > which
> > >> > > > > > > we
> > >> > > > > > > >> > > hadn't
> > >> > > > > > > >> > > > > gotten
> > >> > > > > > > >> > > > > > > to yet, which is great. I was content with
> it
> > >> not
> > >> > > > > > including
> > >> > > > > > > >> the
> > >> > > > > > > >> > > > active
> > >> > > > > > > >> > > > > > > segment.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > > Adding topic level configuration "
> > >> > > > max.compaction.lag.ms
> > >> > > > > > ",
> > >> > > > > > > >> and
> > >> > > > > > > >> > > > > > > corresponding broker configuration "
> > >> > > > > > > >> > log.cleaner.max.compaction.la
> > >> > > > > > > >> > > > g.ms
> > >> > > > > > > >> > > > > ",
> > >> > > > > > > >> > > > > > > which is set to 0 (disabled) by default.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > Glancing at some other settings convention
> > >> seems
> > >> > to
> > >> > > > me to
> > >> > > > > > be
> > >> > > > > > > >> -1
> > >> > > > > > > >> > for
> > >> > > > > > > >> > > > > > > disabled (or infinite, which is more
> > meaningful
> > >> > > > here). 0
> > >> > > > > > to
> > >> > > > > > > me
> > >> > > > > > > >> > > > implies
> > >> > > > > > > >> > > > > > > instant, a little quicker than 1.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > We've been trying to think about a way to
> > >> trigger
> > >> > > > > > compaction
> > >> > > > > > > >> as
> > >> > > > > > > >> > > well
> > >> > > > > > > >> > > > > > > through an API call, which would need to be
> > >> > flagged
> > >> > > > > > > somewhere
> > >> > > > > > > >> (ZK
> > >> > > > > > > >> > > > > admin/
> > >> > > > > > > >> > > > > > > space?) but we're struggling to think how
> > that
> > >> > would
> > >> > > > be
> > >> > > > > > > >> > coordinated
> > >> > > > > > > >> > > > > across
> > >> > > > > > > >> > > > > > > brokers and partitions. Have you given any
> > >> thought
> > >> > > to
> > >> > > > > > that?
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu
> <
> > >> > > > > > > >> xiongqiwu@gmail.com>
> > >> > > > > > > >> > > > > wrote:
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > > Eno, Dong,
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > I have updated the KIP. We decide not to
> > >> address
> > >> > > the
> > >> > > > > > issue
> > >> > > > > > > >> that
> > >> > > > > > > >> > > we
> > >> > > > > > > >> > > > > might
> > >> > > > > > > >> > > > > > > > have for both compaction and time
> retention
> > >> > > enabled
> > >> > > > > > topics
> > >> > > > > > > >> (see
> > >> > > > > > > >> > > the
> > >> > > > > > > >> > > > > > > > rejected alternative item 2). This KIP
> will
> > >> only
> > >> > > > ensure
> > >> > > > > > > log
> > >> > > > > > > >> can
> > >> > > > > > > >> > > be
> > >> > > > > > > >> > > > > > > > compacted after a specified
> time-interval.
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > As suggested by Dong, we will also
> enforce
> > "
> > >> > > > > > > >> > > max.compaction.lag.ms"
> > >> > > > > > > >> > > > > is
> > >> > > > > > > >> > > > > > > not
> > >> > > > > > > >> > > > > > > > less than "min.compaction.lag.ms".
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > >> > > > > > > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > >> > > > > > > >> > <
> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > >> > > > > > > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > >>
> > >> > > > > > > >> > > > > Time-based
> > >> > > > > > > >> > > > > > > log
> > >> > > > > > > >> > > > > > > > compaction policy
> > >> > > > > > > >> > > > > > > > <
> > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > >> > > > > > > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > >> > > > > > > >> > <
> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > >> > > > > > > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > >>
> > >> > > > > > > >> > > > > Time-based
> > >> > > > > > > >> > > > > > > log compaction policy>
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi
> > wu <
> > >> > > > > > > >> > xiongqiwu@gmail.com
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > > wrote:
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > Per discussion with Dong, he made a
> very
> > >> good
> > >> > > > point
> > >> > > > > > that
> > >> > > > > > > >> if
> > >> > > > > > > >> > > > > compaction
> > >> > > > > > > >> > > > > > > > > and time based retention are both
> enabled
> > >> on a
> > >> > > > topic,
> > >> > > > > > > the
> > >> > > > > > > >> > > > > compaction
> > >> > > > > > > >> > > > > > > > might
> > >> > > > > > > >> > > > > > > > > prevent records from being deleted on
> > time.
> > >> > The
> > >> > > > reason
> > >> > > > > > > is
> > >> > > > > > > >> > when
> > >> > > > > > > >> > > > > > > compacting
> > >> > > > > > > >> > > > > > > > > multiple segments into one single
> > segment,
> > >> the
> > >> > > > newly
> > >> > > > > > > >> created
> > >> > > > > > > >> > > > > segment
> > >> > > > > > > >> > > > > > > will
> > >> > > > > > > >> > > > > > > > > have same lastmodified timestamp as
> > latest
> > >> > > > original
> > >> > > > > > > >> segment.
> > >> > > > > > > >> > We
> > >> > > > > > > >> > > > > lose
> > >> > > > > > > >> > > > > > > the
> > >> > > > > > > >> > > > > > > > > timestamp of all original segments
> except
> > >> the
> > >> > > last
> > >> > > > > > one.
> > >> > > > > > > >> As a
> > >> > > > > > > >> > > > > result,
> > >> > > > > > > >> > > > > > > > > records might not be deleted as it
> should
> > >> be
> > >> > > > through
> > >> > > > > > > time
> > >> > > > > > > >> > based
> > >> > > > > > > >> > > > > > > > retention.
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > With the current KIP proposal, if we
> want
> > >> to
> > >> > > > ensure
> > >> > > > > > > timely
> > >> > > > > > > >> > > > > deletion, we
> > >> > > > > > > >> > > > > > > > > have the following configurations:
> > >> > > > > > > >> > > > > > > > > 1) enable time based log compaction
> only
> > :
> > >> > > > deletion is
> > >> > > > > > > >> done
> > >> > > > > > > >> > > > though
> > >> > > > > > > >> > > > > > > > > overriding the same key
> > >> > > > > > > >> > > > > > > > > 2) enable time based log retention
> only:
> > >> > > deletion
> > >> > > > is
> > >> > > > > > > done
> > >> > > > > > > >> > > though
> > >> > > > > > > >> > > > > > > > > time-based retention
> > >> > > > > > > >> > > > > > > > > 3) enable both log compaction and time
> > >> based
> > >> > > > > > retention:
> > >> > > > > > > >> > > Deletion
> > >> > > > > > > >> > > > > is not
> > >> > > > > > > >> > > > > > > > > guaranteed.
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > Not sure if we have use case 3 and also
> > >> want
> > >> > > > deletion
> > >> > > > > > to
> > >> > > > > > > >> > happen
> > >> > > > > > > >> > > > on
> > >> > > > > > > >> > > > > > > time.
> > >> > > > > > > >> > > > > > > > > There are several options to address
> > >> deletion
> > >> > > > issue
> > >> > > > > > when
> > >> > > > > > > >> > enable
> > >> > > > > > > >> > > > > both
> > >> > > > > > > >> > > > > > > > > compaction and retention:
> > >> > > > > > > >> > > > > > > > > A) During log compaction, looking into
> > >> record
> > >> > > > > > timestamp
> > >> > > > > > > to
> > >> > > > > > > >> > > delete
> > >> > > > > > > >> > > > > > > expired
> > >> > > > > > > >> > > > > > > > > records. This can be done in compaction
> > >> logic
> > >> > > > itself
> > >> > > > > > or
> > >> > > > > > > >> use
> > >> > > > > > > >> > > > > > > > > AdminClient.deleteRecords() . But this
> > >> assumes
> > >> > > we
> > >> > > > have
> > >> > > > > > > >> record
> > >> > > > > > > >> > > > > > > timestamp.
> > >> > > > > > > >> > > > > > > > > B) retain the lastModifed time of
> > original
> > >> > > > segments
> > >> > > > > > > during
> > >> > > > > > > >> > log
> > >> > > > > > > >> > > > > > > > compaction.
> > >> > > > > > > >> > > > > > > > > This requires extra meta data to record
> > the
> > >> > > > > > information
> > >> > > > > > > or
> > >> > > > > > > >> > not
> > >> > > > > > > >> > > > > grouping
> > >> > > > > > > >> > > > > > > > > multiple segments into one during
> > >> compaction.
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > If we have use case 3 in general, I
> would
> > >> > prefer
> > >> > > > > > > solution
> > >> > > > > > > >> A
> > >> > > > > > > >> > and
> > >> > > > > > > >> > > > > rely on
> > >> > > > > > > >> > > > > > > > > record timestamp.
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > Two questions:
> > >> > > > > > > >> > > > > > > > > Do we have use case 3? Is it nice to
> have
> > >> or
> > >> > > must
> > >> > > > > > have?
> > >> > > > > > > >> > > > > > > > > If we have use case 3 and want to go
> with
> > >> > > > solution A,
> > >> > > > > > > >> should
> > >> > > > > > > >> > we
> > >> > > > > > > >> > > > > > > introduce
> > >> > > > > > > >> > > > > > > > > a new configuration to enforce deletion
> > by
> > >> > > > timestamp?
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM,
> xiongqi
> > >> wu <
> > >> > > > > > > >> > > xiongqiwu@gmail.com
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >> Dong,
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> Thanks for the comment.
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> There are two retention policy: log
> > >> > compaction
> > >> > > > and
> > >> > > > > > time
> > >> > > > > > > >> > based
> > >> > > > > > > >> > > > > > > retention.
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> Log compaction:
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> we have use cases to keep infinite
> > >> retention
> > >> > > of a
> > >> > > > > > topic
> > >> > > > > > > >> > (only
> > >> > > > > > > >> > > > > > > > >> compaction). GDPR cares about deletion
> > of
> > >> PII
> > >> > > > > > (personal
> > >> > > > > > > >> > > > > identifiable
> > >> > > > > > > >> > > > > > > > >> information) data.
> > >> > > > > > > >> > > > > > > > >> Since Kafka doesn't know what records
> > >> contain
> > >> > > > PII, it
> > >> > > > > > > >> relies
> > >> > > > > > > >> > > on
> > >> > > > > > > >> > > > > upper
> > >> > > > > > > >> > > > > > > > >> layer to delete those records.
> > >> > > > > > > >> > > > > > > > >> For those infinite retention uses
> uses,
> > >> kafka
> > >> > > > needs
> > >> > > > > > to
> > >> > > > > > > >> > > provide a
> > >> > > > > > > >> > > > > way
> > >> > > > > > > >> > > > > > > to
> > >> > > > > > > >> > > > > > > > >> enforce compaction on time. This is
> what
> > >> we
> > >> > try
> > >> > > > to
> > >> > > > > > > >> address
> > >> > > > > > > >> > in
> > >> > > > > > > >> > > > this
> > >> > > > > > > >> > > > > > > KIP.
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> Time based retention,
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> There are also use cases that users of
> > >> Kafka
> > >> > > > might
> > >> > > > > > want
> > >> > > > > > > >> to
> > >> > > > > > > >> > > > expire
> > >> > > > > > > >> > > > > all
> > >> > > > > > > >> > > > > > > > >> their data.
> > >> > > > > > > >> > > > > > > > >> In those cases, they can use time
> based
> > >> > > > retention of
> > >> > > > > > > >> their
> > >> > > > > > > >> > > > topics.
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> Regarding your first question, if a
> user
> > >> > wants
> > >> > > to
> > >> > > > > > > delete
> > >> > > > > > > >> a
> > >> > > > > > > >> > key
> > >> > > > > > > >> > > > in
> > >> > > > > > > >> > > > > the
> > >> > > > > > > >> > > > > > > > >> log compaction topic, the user has to
> > >> send a
> > >> > > > deletion
> > >> > > > > > > >> using
> > >> > > > > > > >> > > the
> > >> > > > > > > >> > > > > same
> > >> > > > > > > >> > > > > > > > key.
> > >> > > > > > > >> > > > > > > > >> Kafka only makes sure the deletion
> will
> > >> > happen
> > >> > > > under
> > >> > > > > > a
> > >> > > > > > > >> > certain
> > >> > > > > > > >> > > > > time
> > >> > > > > > > >> > > > > > > > >> periods (like 2 days/7 days).
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> Regarding your second question. In
> most
> > >> > cases,
> > >> > > we
> > >> > > > > > might
> > >> > > > > > > >> want
> > >> > > > > > > >> > > to
> > >> > > > > > > >> > > > > delete
> > >> > > > > > > >> > > > > > > > >> all duplicated keys at the same time.
> > >> > > > > > > >> > > > > > > > >> Compaction might be more efficient
> since
> > >> we
> > >> > > need
> > >> > > > to
> > >> > > > > > > scan
> > >> > > > > > > >> the
> > >> > > > > > > >> > > log
> > >> > > > > > > >> > > > > and
> > >> > > > > > > >> > > > > > > > find
> > >> > > > > > > >> > > > > > > > >> all duplicates. However, the expected
> > use
> > >> > case
> > >> > > > is to
> > >> > > > > > > set
> > >> > > > > > > >> the
> > >> > > > > > > >> > > > time
> > >> > > > > > > >> > > > > > > based
> > >> > > > > > > >> > > > > > > > >> compaction interval on the order of
> > days,
> > >> and
> > >> > > be
> > >> > > > > > larger
> > >> > > > > > > >> than
> > >> > > > > > > >> > > > 'min
> > >> > > > > > > >> > > > > > > > >> compaction lag". We don't want log
> > >> compaction
> > >> > > to
> > >> > > > > > happen
> > >> > > > > > > >> > > > frequently
> > >> > > > > > > >> > > > > > > since
> > >> > > > > > > >> > > > > > > > >> it is expensive. The purpose is to
> help
> > >> low
> > >> > > > > > production
> > >> > > > > > > >> rate
> > >> > > > > > > >> > > > topic
> > >> > > > > > > >> > > > > to
> > >> > > > > > > >> > > > > > > get
> > >> > > > > > > >> > > > > > > > >> compacted on time. For the topic with
> > >> > "normal"
> > >> > > > > > incoming
> > >> > > > > > > >> > > message
> > >> > > > > > > >> > > > > > > message
> > >> > > > > > > >> > > > > > > > >> rate, the "min dirty ratio" might have
> > >> > > triggered
> > >> > > > the
> > >> > > > > > > >> > > compaction
> > >> > > > > > > >> > > > > before
> > >> > > > > > > >> > > > > > > > this
> > >> > > > > > > >> > > > > > > > >> time based compaction policy takes
> > effect.
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> Eno,
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> For your question, like I mentioned we
> > >> have
> > >> > > long
> > >> > > > time
> > >> > > > > > > >> > > retention
> > >> > > > > > > >> > > > > use
> > >> > > > > > > >> > > > > > > case
> > >> > > > > > > >> > > > > > > > >> for log compacted topic, but we want
> to
> > >> > provide
> > >> > > > > > ability
> > >> > > > > > > >> to
> > >> > > > > > > >> > > > delete
> > >> > > > > > > >> > > > > > > > certain
> > >> > > > > > > >> > > > > > > > >> PII records on time.
> > >> > > > > > > >> > > > > > > > >> Kafka itself doesn't know whether a
> > record
> > >> > > > contains
> > >> > > > > > > >> > sensitive
> > >> > > > > > > >> > > > > > > > information
> > >> > > > > > > >> > > > > > > > >> and relies on the user for deletion.
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong
> > Lin
> > >> <
> > >> > > > > > > >> > > lindong28@gmail.com>
> > >> > > > > > > >> > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >>> Hey Xiongqi,
> > >> > > > > > > >> > > > > > > > >>>
> > >> > > > > > > >> > > > > > > > >>> Thanks for the KIP. I have two
> > questions
> > >> > > > regarding
> > >> > > > > > the
> > >> > > > > > > >> > > use-case
> > >> > > > > > > >> > > > > for
> > >> > > > > > > >> > > > > > > > >>> meeting
> > >> > > > > > > >> > > > > > > > >>> GDPR requirement.
> > >> > > > > > > >> > > > > > > > >>>
> > >> > > > > > > >> > > > > > > > >>> 1) If I recall correctly, one of the
> > GDPR
> > >> > > > > > requirement
> > >> > > > > > > is
> > >> > > > > > > >> > that
> > >> > > > > > > >> > > > we
> > >> > > > > > > >> > > > > can
> > >> > > > > > > >> > > > > > > > not
> > >> > > > > > > >> > > > > > > > >>> keep messages longer than e.g. 30
> days
> > in
> > >> > > > storage
> > >> > > > > > > (e.g.
> > >> > > > > > > >> > > Kafka).
> > >> > > > > > > >> > > > > Say
> > >> > > > > > > >> > > > > > > > there
> > >> > > > > > > >> > > > > > > > >>> exists a partition p0 which contains
> > >> > message1
> > >> > > > with
> > >> > > > > > > key1
> > >> > > > > > > >> and
> > >> > > > > > > >> > > > > message2
> > >> > > > > > > >> > > > > > > > with
> > >> > > > > > > >> > > > > > > > >>> key2. And then user keeps producing
> > >> messages
> > >> > > > with
> > >> > > > > > > >> key=key2
> > >> > > > > > > >> > to
> > >> > > > > > > >> > > > > this
> > >> > > > > > > >> > > > > > > > >>> partition. Since message1 with key1
> is
> > >> never
> > >> > > > > > > overridden,
> > >> > > > > > > >> > > sooner
> > >> > > > > > > >> > > > > or
> > >> > > > > > > >> > > > > > > > later
> > >> > > > > > > >> > > > > > > > >>> we
> > >> > > > > > > >> > > > > > > > >>> will want to delete message1 and keep
> > the
> > >> > > latest
> > >> > > > > > > message
> > >> > > > > > > >> > with
> > >> > > > > > > >> > > > > > > key=key2.
> > >> > > > > > > >> > > > > > > > >>> But
> > >> > > > > > > >> > > > > > > > >>> currently it looks like log compact
> > >> logic in
> > >> > > > Kafka
> > >> > > > > > > will
> > >> > > > > > > >> > > always
> > >> > > > > > > >> > > > > put
> > >> > > > > > > >> > > > > > > > these
> > >> > > > > > > >> > > > > > > > >>> messages in the same segment. Will
> this
> > >> be
> > >> > an
> > >> > > > issue?
> > >> > > > > > > >> > > > > > > > >>>
> > >> > > > > > > >> > > > > > > > >>> 2) The current KIP intends to provide
> > the
> > >> > > > capability
> > >> > > > > > > to
> > >> > > > > > > >> > > delete
> > >> > > > > > > >> > > > a
> > >> > > > > > > >> > > > > > > given
> > >> > > > > > > >> > > > > > > > >>> message in log compacted topic. Does
> > such
> > >> > > > use-case
> > >> > > > > > > also
> > >> > > > > > > >> > > require
> > >> > > > > > > >> > > > > Kafka
> > >> > > > > > > >> > > > > > > > to
> > >> > > > > > > >> > > > > > > > >>> keep the messages produced before the
> > >> given
> > >> > > > message?
> > >> > > > > > > If
> > >> > > > > > > >> > yes,
> > >> > > > > > > >> > > > > then we
> > >> > > > > > > >> > > > > > > > can
> > >> > > > > > > >> > > > > > > > >>> probably just use
> > >> > AdminClient.deleteRecords()
> > >> > > or
> > >> > > > > > > >> time-based
> > >> > > > > > > >> > > log
> > >> > > > > > > >> > > > > > > > retention
> > >> > > > > > > >> > > > > > > > >>> to meet the use-case requirement. If
> > no,
> > >> do
> > >> > > you
> > >> > > > know
> > >> > > > > > > >> what
> > >> > > > > > > >> > is
> > >> > > > > > > >> > > > the
> > >> > > > > > > >> > > > > > > GDPR's
> > >> > > > > > > >> > > > > > > > >>> requirement on time-to-deletion after
> > >> user
> > >> > > > > > explicitly
> > >> > > > > > > >> > > requests
> > >> > > > > > > >> > > > > the
> > >> > > > > > > >> > > > > > > > >>> deletion
> > >> > > > > > > >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> > >> > > > > > > >> > > > > > > > >>>
> > >> > > > > > > >> > > > > > > > >>> Thanks,
> > >> > > > > > > >> > > > > > > > >>> Dong
> > >> > > > > > > >> > > > > > > > >>>
> > >> > > > > > > >> > > > > > > > >>>
> > >> > > > > > > >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM,
> > xiongqi
> > >> wu
> > >> > <
> > >> > > > > > > >> > > > xiongqiwu@gmail.com
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > >>>
> > >> > > > > > > >> > > > > > > > >>> > Hi Eno,
> > >> > > > > > > >> > > > > > > > >>> >
> > >> > > > > > > >> > > > > > > > >>> > The GDPR request we are getting
> here
> > at
> > >> > > > linkedin
> > >> > > > > > is
> > >> > > > > > > >> if we
> > >> > > > > > > >> > > > get a
> > >> > > > > > > >> > > > > > > > >>> request to
> > >> > > > > > > >> > > > > > > > >>> > delete a record through a null key
> > on a
> > >> > log
> > >> > > > > > > compacted
> > >> > > > > > > >> > > topic,
> > >> > > > > > > >> > > > > > > > >>> > we want to delete the record via
> > >> > compaction
> > >> > > > in a
> > >> > > > > > > given
> > >> > > > > > > >> > time
> > >> > > > > > > >> > > > > period
> > >> > > > > > > >> > > > > > > > >>> like 2
> > >> > > > > > > >> > > > > > > > >>> > days (whatever is required by the
> > >> policy).
> > >> > > > > > > >> > > > > > > > >>> >
> > >> > > > > > > >> > > > > > > > >>> > There might be other issues (such
> as
> > >> > orphan
> > >> > > > log
> > >> > > > > > > >> segments
> > >> > > > > > > >> > > > under
> > >> > > > > > > >> > > > > > > > certain
> > >> > > > > > > >> > > > > > > > >>> > conditions) that lead to GDPR
> problem
> > >> but
> > >> > > > they are
> > >> > > > > > > >> more
> > >> > > > > > > >> > > like
> > >> > > > > > > >> > > > > > > > >>> something we
> > >> > > > > > > >> > > > > > > > >>> > need to fix anyway regardless of
> > GDPR.
> > >> > > > > > > >> > > > > > > > >>> >
> > >> > > > > > > >> > > > > > > > >>> >
> > >> > > > > > > >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
> > >> > > > > > > >> > > > > > > > >>> >
> > >> > > > > > > >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM,
> Eno
> > >> > > Thereska
> > >> > > > <
> > >> > > > > > > >> > > > > > > > eno.thereska@gmail.com>
> > >> > > > > > > >> > > > > > > > >>> > wrote:
> > >> > > > > > > >> > > > > > > > >>> >
> > >> > > > > > > >> > > > > > > > >>> > > Hello,
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > > Thanks for the KIP. I'd like to
> > see a
> > >> > more
> > >> > > > > > precise
> > >> > > > > > > >> > > > > definition of
> > >> > > > > > > >> > > > > > > > what
> > >> > > > > > > >> > > > > > > > >>> > part
> > >> > > > > > > >> > > > > > > > >>> > > of GDPR you are targeting as well
> > as
> > >> > some
> > >> > > > sort
> > >> > > > > > of
> > >> > > > > > > >> > > > > verification
> > >> > > > > > > >> > > > > > > that
> > >> > > > > > > >> > > > > > > > >>> this
> > >> > > > > > > >> > > > > > > > >>> > > KIP actually addresses the
> problem.
> > >> > Right
> > >> > > > now I
> > >> > > > > > > find
> > >> > > > > > > >> > > this a
> > >> > > > > > > >> > > > > bit
> > >> > > > > > > >> > > > > > > > >>> vague:
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > > "Ability to delete a log message
> > >> through
> > >> > > > > > > compaction
> > >> > > > > > > >> in
> > >> > > > > > > >> > a
> > >> > > > > > > >> > > > > timely
> > >> > > > > > > >> > > > > > > > >>> manner
> > >> > > > > > > >> > > > > > > > >>> > has
> > >> > > > > > > >> > > > > > > > >>> > > become an important requirement
> in
> > >> some
> > >> > > use
> > >> > > > > > cases
> > >> > > > > > > >> > (e.g.,
> > >> > > > > > > >> > > > > GDPR)"
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > > Is there any guarantee that after
> > >> this
> > >> > KIP
> > >> > > > the
> > >> > > > > > > GDPR
> > >> > > > > > > >> > > problem
> > >> > > > > > > >> > > > > is
> > >> > > > > > > >> > > > > > > > >>> solved or
> > >> > > > > > > >> > > > > > > > >>> > do
> > >> > > > > > > >> > > > > > > > >>> > > we need to do something else as
> > well,
> > >> > > e.g.,
> > >> > > > more
> > >> > > > > > > >> KIPs?
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > > Thanks
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > > Eno
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM,
> > >> xiongqi
> > >> > > wu <
> > >> > > > > > > >> > > > > xiongqiwu@gmail.com>
> > >> > > > > > > >> > > > > > > > >>> wrote:
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> > > > Hi Kafka,
> > >> > > > > > > >> > > > > > > > >>> > > >
> > >> > > > > > > >> > > > > > > > >>> > > > This KIP tries to address GDPR
> > >> concern
> > >> > > to
> > >> > > > > > > fulfill
> > >> > > > > > > >> > > > deletion
> > >> > > > > > > >> > > > > > > > request
> > >> > > > > > > >> > > > > > > > >>> on
> > >> > > > > > > >> > > > > > > > >>> > > time
> > >> > > > > > > >> > > > > > > > >>> > > > through time-based log
> compaction
> > >> on a
> > >> > > > > > > compaction
> > >> > > > > > > >> > > enabled
> > >> > > > > > > >> > > > > > > topic:
> > >> > > > > > > >> > > > > > > > >>> > > >
> > >> > > > > > > >> > > > > > > > >>> > > >
> > >> > > > > > > >> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > >> > > > > > > >> > <
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> > >> > > > > > > >> > > > > > > > <
> > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > >> > > > > > > >> > <
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>
> > >> > > > > > > >> > > > > > > > >>> > > >
> > >> > 354%3A+Time-based+log+compaction+policy
> > >> > > > > > > >> > > > > > > > >>> > > >
> > >> > > > > > > >> > > > > > > > >>> > > > Any feedback will be
> appreciated.
> > >> > > > > > > >> > > > > > > > >>> > > >
> > >> > > > > > > >> > > > > > > > >>> > > >
> > >> > > > > > > >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
> > >> > > > > > > >> > > > > > > > >>> > > >
> > >> > > > > > > >> > > > > > > > >>> > >
> > >> > > > > > > >> > > > > > > > >>> >
> > >> > > > > > > >> > > > > > > > >>>
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >> --
> > >> > > > > > > >> > > > > > > > >> Xiongqi (Wesley) Wu
> > >> > > > > > > >> > > > > > > > >>
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > --
> > >> > > > > > > >> > > > > > > > > Xiongqi (Wesley) Wu
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > --
> > >> > > > > > > >> > > > > > > > Xiongqi (Wesley) Wu
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > --
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > Brett Rann
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > Senior DevOps Engineer
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > Zendesk International Ltd
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > 395 Collins Street, Melbourne VIC 3000
> > >> Australia
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > Mobile: +61 (0) 418 826 017
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > --
> > >> > > > > > > >> > > > > > Xiongqi (Wesley) Wu
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > --
> > >> > > > > > > >> > > > Xiongqi (Wesley) Wu
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > >
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > --
> > >> > > > > > > >> > Xiongqi (Wesley) Wu
> > >> > > > > > > >> >
> > >> > > > > > > >>
> > >> > > > > > > >>
> > >> > > > > > > >> --
> > >> > > > > > > >>
> > >> > > > > > > >> Brett Rann
> > >> > > > > > > >>
> > >> > > > > > > >> Senior DevOps Engineer
> > >> > > > > > > >>
> > >> > > > > > > >>
> > >> > > > > > > >> Zendesk International Ltd
> > >> > > > > > > >>
> > >> > > > > > > >> 395 Collins Street, Melbourne VIC 3000 Australia
> > >> > > > > > > >>
> > >> > > > > > > >> Mobile: +61 (0) 418 826 017
> > >> > > > > > > >>
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > --
> > >> > > > > > > Xiongqi (Wesley) Wu
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > >
> > >> > > > > > Brett Rann
> > >> > > > > >
> > >> > > > > > Senior DevOps Engineer
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Zendesk International Ltd
> > >> > > > > >
> > >> > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > >> > > > > >
> > >> > > > > > Mobile: +61 (0) 418 826 017
> > >> > > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > -Regards,
> > >> > > Mayuresh R. Gharat
> > >> > > (862) 250-7125
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-354 Time-based log compaction policy

Posted by Dong Lin <li...@gmail.com>.
Hey Xiongqi,

Thanks for the update. A few more comments below

1) According to the definition of
kafka.log:type=LogCleaner,name=max-compaction-delay, it seems that the
metric value will be a large negative number if max.compaction.lag.ms is
MAX_LONG. Would this be a problem? Also, it seems weird that the value of
the metric is defined w.r.t. how often the log cleaner is run.

2) Not sure if we need the metric num-logs-compacted-by-max-compaction-lag
in addition to max-compaction-delay. It seems that operator can just use
max-compaction-delay to determine whether the max.compaction.lag is
properly enforced in a quantitative manner. Also, the metric name
`num-logs-compacted-by-max-compaction-lag` is inconsistent with its
intended meaning, i.e. the number of logs that needs to be compacted due to
max.compaction.lag but not yet compacted. So it is probably simple to just
remove this metric.

3) The KIP currently says that "a message record has a guaranteed
upper-bound in time to become mandatory for compaction". The word
"guarantee" may be misleading because the message may still not be
compacted within max.compaction.lag after its creation. Could you clarify
the exact semantics of the max.compaction.lag.ms in the Public Interface
section?

4) The KIP's proposed change will estimate earliest message timestamp for
un-compacted log segments. Can you explain how broker determines whether a
segment has been compacted after the broker is restarted?

5) 2.b in Proposed Change section provides two way to get timestamp. To
make the KIP easier to read for future reference, could we just mention the
method that we plan to use and move the other solution to the rejected
alternative section?

6) Based on the discussion (i.e. point 2 in the previous email), it is said
that we can assume all messages have timestamp and the feature added in
this KIP can be skipped for those messages which do not have timestamp. So
do we still need to use "segment.largestTimestamp - maxSegmentMs" in
Proposed Change section 2.a?

7) Based on the discussion (i.e. point 8 in the previous email), if this
KIP requires user to monitor certain existing metrics for performance
impact added in this KIP, can we list the metrics in the KIP for user's
convenience?


Thanks,
Dong

On Mon, Oct 29, 2018 at 3:16 PM xiongqi wu <xi...@gmail.com> wrote:

> Hi Dong,
> I have updated the KIP to address your comments.
> One correction to previous Email:
> after offline discussion with Dong,  we decide to use MAX_LONG as default
> value for max.compaction.lag.ms.
>
>
> Xiongqi (Wesley) Wu
>
>
> On Mon, Oct 29, 2018 at 12:15 PM xiongqi wu <xi...@gmail.com> wrote:
>
> > Hi Dong,
> >
> > Thank you for your comment.  See my inline comments.
> > I will update the KIP shortly.
> >
> > Xiongqi (Wesley) Wu
> >
> >
> > On Sun, Oct 28, 2018 at 9:17 PM Dong Lin <li...@gmail.com> wrote:
> >
> >> Hey Xiongqi,
> >>
> >> Sorry for late reply. I have some comments below:
> >>
> >> 1) As discussed earlier in the email list, if the topic is configured
> with
> >> both deletion and compaction, in some cases messages produced a long
> time
> >> ago can not be deleted based on time. This is a valid use-case because
> we
> >> actually have topic which is configured with both deletion and
> compaction
> >> policy. And we should enforce the semantics for both policy. Solution A
> >> sounds good. We do not need interface change (e.g. extra config) to
> >> enforce
> >> solution A. All we need is to update implementation so that when broker
> >> compacts a topic, if the message has timestamp (which is the common
> case),
> >> messages that are too old (based on the time-based retention config)
> will
> >> be discarded. Since this is a valid issue and it is also related to the
> >> guarantee of when a message can be deleted, can we include the solution
> of
> >> this problem in the KIP?
> >>
> > ======  This makes sense.  We can use similar approach to increase the
> log
> > start offset.
> >
> >>
> >> 2) It is probably OK to assume that all messages have timestamp. The
> >> per-message timestamp was introduced into Kafka 0.10.0 with KIP-31 and
> >> KIP-32 as of Feb 2016. Kafka 0.10.0 or earlier versions are no longer
> >> supported. Also, since the use-case for this feature is primarily for
> >> GDPR,
> >> we can assume that client library has already been upgraded to support
> >> SSL,
> >> which feature is added after KIP-31 and KIP-32.
> >>
> >>  =========>  Ok. We can use message timestamp to delete expired records
> > if both compaction and retention are enabled.
> >
> >
> > 3) In Proposed Change section 2.a, it is said that
> segment.largestTimestamp
> >> - maxSegmentMs can be used to determine the timestamp of the earliest
> >> message. Would it be simpler to just use the create time of the file to
> >> determine the time?
> >>
> >> ========>  Linux/Java doesn't provide API for file creation time because
> > some filesystem type doesn't provide file creation time.
> >
> >
> >> 4) The KIP suggests to use must-clean-ratio to select the partition to
> be
> >> compacted. Unlike dirty ratio which is mostly for performance, the logs
> >> whose "must-clean-ratio" is non-zero must be compacted immediately for
> >> correctness reason (and for GDPR). And if this can no be achieved
> because
> >> e.g. broker compaction throughput is too low, investigation will be
> >> needed.
> >> So it seems simpler to first compact logs which has segment whose
> earliest
> >> timetamp is earlier than now - max.compaction.lag.ms, instead of
> defining
> >> must-clean-ratio and sorting logs based on this value.
> >>
> >>
> > ======>  Good suggestion. This can simply the implementation quite a bit
> > if we are not too concerned about compaction of GDPR required partition
> > queued behind some large partition.  The actual compaction completion
> time
> > is not guaranteed anyway.
> >
> >
> >> 5) The KIP says max.compaction.lag.ms is 0 by default and it is also
> >> suggested that 0 means disable. Should we set this value to MAX_LONG by
> >> default to effectively disable the feature added in this KIP?
> >>
> >> ====> I would rather use 0 so the corresponding code path will not be
> > exercised.  By using MAX_LONG, we would theoretically go through related
> > code to find out whether the partition is required to be compacted to
> > satisfy MAX_LONG.
> >
> > 6) It is probably cleaner and readable not to include in Public Interface
> >> section those configs whose meaning is not changed.
> >>
> >> ====> I will clean that up.
> >
> > 7) The goal of this KIP is to ensure that log segment whose earliest
> >> message is earlier than a given threshold will be compacted. This goal
> may
> >> not be achieved if the compact throughput can not catchup with the total
> >> bytes-in-rate for the compacted topics on the broker. Thus we need an
> easy
> >> way to tell operator whether this goal is achieved. If we don't already
> >> have such metric, maybe we can include metrics to show 1) the total
> number
> >> of log segments (or logs) which needs to be immediately compacted as
> >> determined by max.compaction.lag; and 2) the maximum value of now -
> >> earliest_time_stamp_of_segment among all segments that needs to be
> >> compacted.
> >>
> >> =======> good suggestion.  I will update KIP for these metrics.
> >
> > 8) The Performance Impact suggests user to use the existing metrics to
> >> monitor the performance impact of this KIP. It i useful to list mean of
> >> each jmx metrics that we want user to monitor, and possibly explain how
> to
> >> interpret the value of these metrics to determine whether there is
> >> performance issue.
> >>
> >> =========>  I will update the KIP.
> >
> >> Thanks,
> >> Dong
> >>
> >> On Tue, Oct 16, 2018 at 10:53 AM xiongqi wu <xi...@gmail.com>
> wrote:
> >>
> >> > Mayuresh,
> >> >
> >> > Thanks for the comments.
> >> > The requirement is that we need to pick up segments that are older
> than
> >> > maxCompactionLagMs for compaction.
> >> > maxCompactionLagMs is an upper-bound, which implies that picking up
> >> > segments for compaction earlier doesn't violated the policy.
> >> > We use the creation time of a segment as an estimation of its records
> >> > arrival time, so these records can be compacted no later than
> >> > maxCompactionLagMs.
> >> >
> >> > On the other hand, compaction is an expensive operation, we don't want
> >> to
> >> > compact the log partition whenever a new segment is sealed.
> >> > Therefore, we want to pick up a segment for compaction when the
> segment
> >> is
> >> > closed to mandatory max compaction lag (so we use segment creation
> time
> >> as
> >> > an estimation.)
> >> >
> >> >
> >> > Xiongqi (Wesley) Wu
> >> >
> >> >
> >> > On Mon, Oct 15, 2018 at 5:54 PM Mayuresh Gharat <
> >> > gharatmayuresh15@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Wesley,
> >> > >
> >> > > Thanks for the KIP and sorry for being late to the party.
> >> > >  I wanted to understand, the scenario you mentioned in Proposed
> >> changes :
> >> > >
> >> > > -
> >> > > >
> >> > > > Estimate the earliest message timestamp of an un-compacted log
> >> segment.
> >> > > we
> >> > > > only need to estimate earliest message timestamp for un-compacted
> >> log
> >> > > > segments to ensure timely compaction because the deletion requests
> >> that
> >> > > > belong to compacted segments have already been processed.
> >> > > >
> >> > > >    1.
> >> > > >
> >> > > >    for the first (earliest) log segment:  The estimated earliest
> >> > > >    timestamp is set to the timestamp of the first message if
> >> timestamp
> >> > is
> >> > > >    present in the message. Otherwise, the estimated earliest
> >> timestamp
> >> > > is set
> >> > > >    to "segment.largestTimestamp - maxSegmentMs”
> >> > > >     (segment.largestTimestamp is lastModified time of the log
> >> segment
> >> > or
> >> > > max
> >> > > >    timestamp we see for the log segment.). In the later case, the
> >> > actual
> >> > > >    timestamp of the first message might be later than the
> >> estimation,
> >> > > but it
> >> > > >    is safe to pick up the log for compaction earlier.
> >> > > >
> >> > > > When we say "actual timestamp of the first message might be later
> >> than
> >> > > the
> >> > > estimation, but it is safe to pick up the log for compaction
> >> earlier.",
> >> > > doesn't that violate the assumption that we will consider a segment
> >> for
> >> > > compaction only if the time of creation the segment has crossed the
> >> "now
> >> > -
> >> > > maxCompactionLagMs" ?
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Mayuresh
> >> > >
> >> > > On Mon, Sep 3, 2018 at 7:28 PM Brett Rann <brann@zendesk.com.invalid
> >
> >> > > wrote:
> >> > >
> >> > > > Might also be worth moving to a vote thread? Discussion seems to
> >> have
> >> > > gone
> >> > > > as far as it can.
> >> > > >
> >> > > > > On 4 Sep 2018, at 12:08, xiongqi wu <xi...@gmail.com>
> wrote:
> >> > > > >
> >> > > > > Brett,
> >> > > > >
> >> > > > > Yes, I will post PR tomorrow.
> >> > > > >
> >> > > > > Xiongqi (Wesley) Wu
> >> > > > >
> >> > > > >
> >> > > > > On Sun, Sep 2, 2018 at 6:28 PM Brett Rann
> >> <brann@zendesk.com.invalid
> >> > >
> >> > > > wrote:
> >> > > > >
> >> > > > > > +1 (non-binding) from me on the interface. I'd like to see
> >> someone
> >> > > > familiar
> >> > > > > > with
> >> > > > > > the code comment on the approach, and note there's a couple of
> >> > > > different
> >> > > > > > approaches: what's documented in the KIP, and what Xiaohe Dong
> >> was
> >> > > > working
> >> > > > > > on
> >> > > > > > here:
> >> > > > > >
> >> > > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0
> >> > > > > >
> >> > > > > > If you have code working already Xiongqi Wu could you share a
> >> PR?
> >> > I'd
> >> > > > be
> >> > > > > > happy
> >> > > > > > to start testing.
> >> > > > > >
> >> > > > > > On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <
> xiongqiwu@gmail.com
> >> >
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > Hi All,
> >> > > > > > >
> >> > > > > > > Do you have any additional comments on this KIP?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <
> >> xiongqiwu@gmail.com
> >> > >
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > on 2)
> >> > > > > > > > The offsetmap is built starting from dirty segment.
> >> > > > > > > > The compaction starts from the beginning of the log
> >> partition.
> >> > > > That's
> >> > > > > > how
> >> > > > > > > > it ensure the deletion of tomb keys.
> >> > > > > > > > I will double check tomorrow.
> >> > > > > > > >
> >> > > > > > > > Xiongqi (Wesley) Wu
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann
> >> > > > <br...@zendesk.com.invalid>
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > >> To just clarify a bit on 1. whether there's an external
> >> > > storage/DB
> >> > > > > > isn't
> >> > > > > > > >> relevant here.
> >> > > > > > > >> Compacted topics allow a tombstone record to be sent (a
> >> null
> >> > > value
> >> > > > > > for a
> >> > > > > > > >> key) which
> >> > > > > > > >> currently will result in old values for that key being
> >> deleted
> >> > > if
> >> > > > some
> >> > > > > > > >> conditions are met.
> >> > > > > > > >> There are existing controls to make sure the old values
> >> will
> >> > > stay
> >> > > > > > around
> >> > > > > > > >> for a minimum
> >> > > > > > > >> time at least, but no dedicated control to ensure the
> >> > tombstone
> >> > > > will
> >> > > > > > > >> delete
> >> > > > > > > >> within a
> >> > > > > > > >> maximum time.
> >> > > > > > > >>
> >> > > > > > > >> One popular reason that maximum time for deletion is
> >> desirable
> >> > > > right
> >> > > > > > now
> >> > > > > > > >> is
> >> > > > > > > >> GDPR with
> >> > > > > > > >> PII. But we're not proposing any GDPR awareness in kafka,
> >> just
> >> > > > being
> >> > > > > > > able
> >> > > > > > > >> to guarantee
> >> > > > > > > >> a max time where a tombstoned key will be removed from
> the
> >> > > > compacted
> >> > > > > > > >> topic.
> >> > > > > > > >>
> >> > > > > > > >> on 2)
> >> > > > > > > >> huh, i thought it kept track of the first dirty segment
> and
> >> > > didn't
> >> > > > > > > >> recompact older "clean" ones.
> >> > > > > > > >> But I didn't look at code or test for that.
> >> > > > > > > >>
> >> > > > > > > >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <
> >> > > xiongqiwu@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > > >>
> >> > > > > > > >> > 1, Owner of data (in this sense, kafka is the not the
> >> owner
> >> > of
> >> > > > data)
> >> > > > > > > >> > should keep track of lifecycle of the data in some
> >> external
> >> > > > > > > storage/DB.
> >> > > > > > > >> > The owner determines when to delete the data and send
> the
> >> > > delete
> >> > > > > > > >> request to
> >> > > > > > > >> > kafka. Kafka doesn't know about the content of data but
> >> to
> >> > > > provide a
> >> > > > > > > >> mean
> >> > > > > > > >> > for deletion.
> >> > > > > > > >> >
> >> > > > > > > >> > 2 , each time compaction runs, it will start from first
> >> > > > segments (no
> >> > > > > > > >> > matter if it is compacted or not). The time estimation
> >> here
> >> > is
> >> > > > only
> >> > > > > > > used
> >> > > > > > > >> > to determine whether we should run compaction on this
> log
> >> > > > partition.
> >> > > > > > > So
> >> > > > > > > >> we
> >> > > > > > > >> > only need to estimate uncompacted segments.
> >> > > > > > > >> >
> >> > > > > > > >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <
> >> > > lindong28@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > > >> >
> >> > > > > > > >> > > Hey Xiongqi,
> >> > > > > > > >> > >
> >> > > > > > > >> > > Thanks for the update. I have two questions for the
> >> latest
> >> > > > KIP.
> >> > > > > > > >> > >
> >> > > > > > > >> > > 1) The motivation section says that one use case is
> to
> >> > > delete
> >> > > > PII
> >> > > > > > > >> > (Personal
> >> > > > > > > >> > > Identifiable information) data within 7 days while
> >> keeping
> >> > > > non-PII
> >> > > > > > > >> > > indefinitely in compacted format. I suppose the
> >> use-case
> >> > > > depends
> >> > > > > > on
> >> > > > > > > >> the
> >> > > > > > > >> > > application to determine when to delete those PII
> data.
> >> > > Could
> >> > > > you
> >> > > > > > > >> explain
> >> > > > > > > >> > > how can application reliably determine the set of
> keys
> >> > that
> >> > > > should
> >> > > > > > > be
> >> > > > > > > >> > > deleted? Is application required to always messages
> >> from
> >> > the
> >> > > > topic
> >> > > > > > > >> after
> >> > > > > > > >> > > every restart and determine the keys to be deleted by
> >> > > looking
> >> > > > at
> >> > > > > > > >> message
> >> > > > > > > >> > > timestamp, or is application supposed to persist the
> >> key->
> >> > > > > > timstamp
> >> > > > > > > >> > > information in a separate persistent storage system?
> >> > > > > > > >> > >
> >> > > > > > > >> > > 2) It is mentioned in the KIP that "we only need to
> >> > estimate
> >> > > > > > > earliest
> >> > > > > > > >> > > message timestamp for un-compacted log segments
> because
> >> > the
> >> > > > > > deletion
> >> > > > > > > >> > > requests that belong to compacted segments have
> already
> >> > been
> >> > > > > > > >> processed".
> >> > > > > > > >> > > Not sure if it is correct. If a segment is compacted
> >> > before
> >> > > > user
> >> > > > > > > sends
> >> > > > > > > >> > > message to delete a key in this segment, it seems
> that
> >> we
> >> > > > still
> >> > > > > > need
> >> > > > > > > >> to
> >> > > > > > > >> > > ensure that the segment will be compacted again
> within
> >> the
> >> > > > given
> >> > > > > > > time
> >> > > > > > > >> > after
> >> > > > > > > >> > > the deletion is requested, right?
> >> > > > > > > >> > >
> >> > > > > > > >> > > Thanks,
> >> > > > > > > >> > > Dong
> >> > > > > > > >> > >
> >> > > > > > > >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <
> >> > > > xiongqiwu@gmail.com
> >> > > > > > >
> >> > > > > > > >> > wrote:
> >> > > > > > > >> > >
> >> > > > > > > >> > > > Hi Xiaohe,
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > Quick note:
> >> > > > > > > >> > > > 1) Use minimum of segment.ms and
> >> max.compaction.lag.ms
> >> > > > > > > >> > > > <http://max.compaction.ms
> >> > > > > > > <http://max.compaction.ms>
> >> > > > > > > >> > <http://max.compaction.ms
> >> > > > > > > <http://max.compaction.ms>>>
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > 2) I am not sure if I get your second question.
> >> first,
> >> > we
> >> > > > have
> >> > > > > > > >> jitter
> >> > > > > > > >> > > when
> >> > > > > > > >> > > > we roll the active segment. second, on each
> >> compaction,
> >> > we
> >> > > > > > compact
> >> > > > > > > >> upto
> >> > > > > > > >> > > > the offsetmap could allow. Those will not lead to
> >> > perfect
> >> > > > > > > compaction
> >> > > > > > > >> > > storm
> >> > > > > > > >> > > > overtime. In addition, I expect we are setting
> >> > > > > > > >> max.compaction.lag.ms
> >> > > > > > > >> > on
> >> > > > > > > >> > > > the order of days.
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > 3) I don't have access to the confluent community
> >> slack
> >> > > for
> >> > > > > > now. I
> >> > > > > > > >> am
> >> > > > > > > >> > > > reachable via the google handle out.
> >> > > > > > > >> > > > To avoid the double effort, here is my plan:
> >> > > > > > > >> > > > a) Collect more feedback and feature requriement on
> >> the
> >> > > KIP.
> >> > > > > > > >> > > > b) Wait unitl this KIP is approved.
> >> > > > > > > >> > > > c) I will address any additional requirements in
> the
> >> > > > > > > implementation.
> >> > > > > > > >> > (My
> >> > > > > > > >> > > > current implementation only complies to whatever
> >> > described
> >> > > > in
> >> > > > > > the
> >> > > > > > > >> KIP
> >> > > > > > > >> > > now)
> >> > > > > > > >> > > > d) I can share the code with the you and community
> >> see
> >> > you
> >> > > > want
> >> > > > > > to
> >> > > > > > > >> add
> >> > > > > > > >> > > > anything.
> >> > > > > > > >> > > > e) submission through committee
> >> > > > > > > >> > > >
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
> >> > > > > > > >> dannyrivclo@gmail.com>
> >> > > > > > > >> > > > wrote:
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > > Hi Xiongqi
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > Thanks for thinking about implementing this as
> >> well.
> >> > :)
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > I was thinking about using `segment.ms` to
> trigger
> >> > the
> >> > > > > > segment
> >> > > > > > > >> roll.
> >> > > > > > > >> > > > > Also, its value can be the largest time bias for
> >> the
> >> > > > record
> >> > > > > > > >> deletion.
> >> > > > > > > >> > > For
> >> > > > > > > >> > > > > example, if the `segment.ms` is 1 day and `
> >> > > > max.compaction.ms`
> >> > > > > > > is
> >> > > > > > > >> 30
> >> > > > > > > >> > > > days,
> >> > > > > > > >> > > > > the compaction may happen around 31 days.
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > For my curiosity, is there a way we can do some
> >> > > > performance
> >> > > > > > test
> >> > > > > > > >> for
> >> > > > > > > >> > > this
> >> > > > > > > >> > > > > and any tools you can recommend. As you know,
> >> > > previously,
> >> > > > it
> >> > > > > > is
> >> > > > > > > >> > cleaned
> >> > > > > > > >> > > > up
> >> > > > > > > >> > > > > by respecting dirty ratio, but now it may happen
> >> > anytime
> >> > > > if
> >> > > > > > max
> >> > > > > > > >> lag
> >> > > > > > > >> > has
> >> > > > > > > >> > > > > passed for each message. I wonder what would
> >> happen if
> >> > > > clients
> >> > > > > > > >> send
> >> > > > > > > >> > > huge
> >> > > > > > > >> > > > > amount of tombstone records at the same time.
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > I am looking forward to have a quick chat with
> you
> >> to
> >> > > > avoid
> >> > > > > > > double
> >> > > > > > > >> > > effort
> >> > > > > > > >> > > > > on this. I am in confluent community slack during
> >> the
> >> > > work
> >> > > > > > time.
> >> > > > > > > >> My
> >> > > > > > > >> > > name
> >> > > > > > > >> > > > is
> >> > > > > > > >> > > > > Xiaohe Dong. :)
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > Rgds
> >> > > > > > > >> > > > > Xiaohe Dong
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > On 2018/08/16 01:22:22, xiongqi wu <
> >> > xiongqiwu@gmail.com
> >> > > >
> >> > > > > > wrote:
> >> > > > > > > >> > > > > > Brett,
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > Thank you for your comments.
> >> > > > > > > >> > > > > > I was thinking since we already has immediate
> >> > > compaction
> >> > > > > > > >> setting by
> >> > > > > > > >> > > > > setting
> >> > > > > > > >> > > > > > min dirty ratio to 0, so I decide to use "0" as
> >> > > disabled
> >> > > > > > > state.
> >> > > > > > > >> > > > > > I am ok to go with -1(disable), 0 (immediate)
> >> > options.
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > For the implementation, there are a few
> >> differences
> >> > > > between
> >> > > > > > > mine
> >> > > > > > > >> > and
> >> > > > > > > >> > > > > > "Xiaohe Dong"'s :
> >> > > > > > > >> > > > > > 1) I used the estimated creation time of a log
> >> > segment
> >> > > > > > instead
> >> > > > > > > >> of
> >> > > > > > > >> > > > largest
> >> > > > > > > >> > > > > > timestamp of a log to determine the compaction
> >> > > > eligibility,
> >> > > > > > > >> > because a
> >> > > > > > > >> > > > log
> >> > > > > > > >> > > > > > segment might stay as an active segment up to
> >> "max
> >> > > > > > compaction
> >> > > > > > > >> lag".
> >> > > > > > > >> > > > (see
> >> > > > > > > >> > > > > > the KIP for detail).
> >> > > > > > > >> > > > > > 2) I measure how much bytes that we must clean
> to
> >> > > > follow the
> >> > > > > > > >> "max
> >> > > > > > > >> > > > > > compaction lag" rule, and use that to determine
> >> the
> >> > > > order of
> >> > > > > > > >> > > > compaction.
> >> > > > > > > >> > > > > > 3) force active segment to roll to follow the
> >> "max
> >> > > > > > compaction
> >> > > > > > > >> lag"
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > I can share my code so we can coordinate.
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > I haven't think about a new API to force a
> >> > compaction.
> >> > > > what
> >> > > > > > is
> >> > > > > > > >> the
> >> > > > > > > >> > > use
> >> > > > > > > >> > > > > case
> >> > > > > > > >> > > > > > for this one?
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> >> > > > > > > >> > > <brann@zendesk.com.invalid
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > > wrote:
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > > We've been looking into this too.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > Mailing list:
> >> > > > > > > >> > > > > > > https://lists.apache.org/thread.html/
> >> > > > > > > <https://lists.apache.org/thread.html/>
> >> > > > > > > >> > <https://lists.apache.org/thread.html/
> >> > > > > > > <https://lists.apache.org/thread.html/>>
> >> > > > > > > >> > > ed7f6a6589f94e8c2a705553f364ef
> >> > > > > > > >> > > > > > > 599cb6915e4c3ba9b561e610e4@%
> >> > 3Cdev.kafka.apache.org
> >> > > %3E
> >> > > > > > > >> > > > > > > jira wish:
> >> > > > > > https://issues.apache.org/jira/browse/KAFKA-7137
> >> > > > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>
> >> > > > > > > >> > <https://issues.apache.org/jira/browse/KAFKA-7137
> >> > > > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>>
> >> > > > > > > >> > > > > > > confluent slack discussion:
> >> > > > > > > >> > > > > > >
> >> > > > https://confluentcommunity.slack.com/archives/C49R61XMM/
> >> > > > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>
> >> > > > > > > >> > <
> >> https://confluentcommunity.slack.com/archives/C49R61XMM/
> >> > > > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>>
> >> > > > > > > >> > > > > p1530760121000039
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > A person on my team has started on code so
> you
> >> > might
> >> > > > want
> >> > > > > > to
> >> > > > > > > >> > > > > coordinate:
> >> > > > > > > >> > > > > > >
> >> > > > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> >> > > > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
> >> > > > > > > >> > <
> >> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> >> > > > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->>
> >> > > > > > > >> > > > > > > cleaner-compaction-max-lifetime-2.0
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > He's been working with Jason Gustafson and
> >> James
> >> > > Chen
> >> > > > > > around
> >> > > > > > > >> the
> >> > > > > > > >> > > > > changes.
> >> > > > > > > >> > > > > > > You can ping him on confluent slack as Xiaohe
> >> > Dong.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > It's great to know others are thinking on it
> as
> >> > > well.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > You've added the requirement to force a
> segment
> >> > roll
> >> > > > which
> >> > > > > > > we
> >> > > > > > > >> > > hadn't
> >> > > > > > > >> > > > > gotten
> >> > > > > > > >> > > > > > > to yet, which is great. I was content with it
> >> not
> >> > > > > > including
> >> > > > > > > >> the
> >> > > > > > > >> > > > active
> >> > > > > > > >> > > > > > > segment.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > > Adding topic level configuration "
> >> > > > max.compaction.lag.ms
> >> > > > > > ",
> >> > > > > > > >> and
> >> > > > > > > >> > > > > > > corresponding broker configuration "
> >> > > > > > > >> > log.cleaner.max.compaction.la
> >> > > > > > > >> > > > g.ms
> >> > > > > > > >> > > > > ",
> >> > > > > > > >> > > > > > > which is set to 0 (disabled) by default.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > Glancing at some other settings convention
> >> seems
> >> > to
> >> > > > me to
> >> > > > > > be
> >> > > > > > > >> -1
> >> > > > > > > >> > for
> >> > > > > > > >> > > > > > > disabled (or infinite, which is more
> meaningful
> >> > > > here). 0
> >> > > > > > to
> >> > > > > > > me
> >> > > > > > > >> > > > implies
> >> > > > > > > >> > > > > > > instant, a little quicker than 1.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > We've been trying to think about a way to
> >> trigger
> >> > > > > > compaction
> >> > > > > > > >> as
> >> > > > > > > >> > > well
> >> > > > > > > >> > > > > > > through an API call, which would need to be
> >> > flagged
> >> > > > > > > somewhere
> >> > > > > > > >> (ZK
> >> > > > > > > >> > > > > admin/
> >> > > > > > > >> > > > > > > space?) but we're struggling to think how
> that
> >> > would
> >> > > > be
> >> > > > > > > >> > coordinated
> >> > > > > > > >> > > > > across
> >> > > > > > > >> > > > > > > brokers and partitions. Have you given any
> >> thought
> >> > > to
> >> > > > > > that?
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <
> >> > > > > > > >> xiongqiwu@gmail.com>
> >> > > > > > > >> > > > > wrote:
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > > Eno, Dong,
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > > I have updated the KIP. We decide not to
> >> address
> >> > > the
> >> > > > > > issue
> >> > > > > > > >> that
> >> > > > > > > >> > > we
> >> > > > > > > >> > > > > might
> >> > > > > > > >> > > > > > > > have for both compaction and time retention
> >> > > enabled
> >> > > > > > topics
> >> > > > > > > >> (see
> >> > > > > > > >> > > the
> >> > > > > > > >> > > > > > > > rejected alternative item 2). This KIP will
> >> only
> >> > > > ensure
> >> > > > > > > log
> >> > > > > > > >> can
> >> > > > > > > >> > > be
> >> > > > > > > >> > > > > > > > compacted after a specified time-interval.
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > > As suggested by Dong, we will also enforce
> "
> >> > > > > > > >> > > max.compaction.lag.ms"
> >> > > > > > > >> > > > > is
> >> > > > > > > >> > > > > > > not
> >> > > > > > > >> > > > > > > > less than "min.compaction.lag.ms".
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > >
> >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> >> > > > > > > >> > <
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> >>
> >> > > > > > > >> > > > > Time-based
> >> > > > > > > >> > > > > > > log
> >> > > > > > > >> > > > > > > > compaction policy
> >> > > > > > > >> > > > > > > > <
> >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> >> > > > > > > >> > <
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> >>
> >> > > > > > > >> > > > > Time-based
> >> > > > > > > >> > > > > > > log compaction policy>
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi
> wu <
> >> > > > > > > >> > xiongqiwu@gmail.com
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > > wrote:
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > Per discussion with Dong, he made a very
> >> good
> >> > > > point
> >> > > > > > that
> >> > > > > > > >> if
> >> > > > > > > >> > > > > compaction
> >> > > > > > > >> > > > > > > > > and time based retention are both enabled
> >> on a
> >> > > > topic,
> >> > > > > > > the
> >> > > > > > > >> > > > > compaction
> >> > > > > > > >> > > > > > > > might
> >> > > > > > > >> > > > > > > > > prevent records from being deleted on
> time.
> >> > The
> >> > > > reason
> >> > > > > > > is
> >> > > > > > > >> > when
> >> > > > > > > >> > > > > > > compacting
> >> > > > > > > >> > > > > > > > > multiple segments into one single
> segment,
> >> the
> >> > > > newly
> >> > > > > > > >> created
> >> > > > > > > >> > > > > segment
> >> > > > > > > >> > > > > > > will
> >> > > > > > > >> > > > > > > > > have same lastmodified timestamp as
> latest
> >> > > > original
> >> > > > > > > >> segment.
> >> > > > > > > >> > We
> >> > > > > > > >> > > > > lose
> >> > > > > > > >> > > > > > > the
> >> > > > > > > >> > > > > > > > > timestamp of all original segments except
> >> the
> >> > > last
> >> > > > > > one.
> >> > > > > > > >> As a
> >> > > > > > > >> > > > > result,
> >> > > > > > > >> > > > > > > > > records might not be deleted as it should
> >> be
> >> > > > through
> >> > > > > > > time
> >> > > > > > > >> > based
> >> > > > > > > >> > > > > > > > retention.
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > With the current KIP proposal, if we want
> >> to
> >> > > > ensure
> >> > > > > > > timely
> >> > > > > > > >> > > > > deletion, we
> >> > > > > > > >> > > > > > > > > have the following configurations:
> >> > > > > > > >> > > > > > > > > 1) enable time based log compaction only
> :
> >> > > > deletion is
> >> > > > > > > >> done
> >> > > > > > > >> > > > though
> >> > > > > > > >> > > > > > > > > overriding the same key
> >> > > > > > > >> > > > > > > > > 2) enable time based log retention only:
> >> > > deletion
> >> > > > is
> >> > > > > > > done
> >> > > > > > > >> > > though
> >> > > > > > > >> > > > > > > > > time-based retention
> >> > > > > > > >> > > > > > > > > 3) enable both log compaction and time
> >> based
> >> > > > > > retention:
> >> > > > > > > >> > > Deletion
> >> > > > > > > >> > > > > is not
> >> > > > > > > >> > > > > > > > > guaranteed.
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > Not sure if we have use case 3 and also
> >> want
> >> > > > deletion
> >> > > > > > to
> >> > > > > > > >> > happen
> >> > > > > > > >> > > > on
> >> > > > > > > >> > > > > > > time.
> >> > > > > > > >> > > > > > > > > There are several options to address
> >> deletion
> >> > > > issue
> >> > > > > > when
> >> > > > > > > >> > enable
> >> > > > > > > >> > > > > both
> >> > > > > > > >> > > > > > > > > compaction and retention:
> >> > > > > > > >> > > > > > > > > A) During log compaction, looking into
> >> record
> >> > > > > > timestamp
> >> > > > > > > to
> >> > > > > > > >> > > delete
> >> > > > > > > >> > > > > > > expired
> >> > > > > > > >> > > > > > > > > records. This can be done in compaction
> >> logic
> >> > > > itself
> >> > > > > > or
> >> > > > > > > >> use
> >> > > > > > > >> > > > > > > > > AdminClient.deleteRecords() . But this
> >> assumes
> >> > > we
> >> > > > have
> >> > > > > > > >> record
> >> > > > > > > >> > > > > > > timestamp.
> >> > > > > > > >> > > > > > > > > B) retain the lastModifed time of
> original
> >> > > > segments
> >> > > > > > > during
> >> > > > > > > >> > log
> >> > > > > > > >> > > > > > > > compaction.
> >> > > > > > > >> > > > > > > > > This requires extra meta data to record
> the
> >> > > > > > information
> >> > > > > > > or
> >> > > > > > > >> > not
> >> > > > > > > >> > > > > grouping
> >> > > > > > > >> > > > > > > > > multiple segments into one during
> >> compaction.
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > If we have use case 3 in general, I would
> >> > prefer
> >> > > > > > > solution
> >> > > > > > > >> A
> >> > > > > > > >> > and
> >> > > > > > > >> > > > > rely on
> >> > > > > > > >> > > > > > > > > record timestamp.
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > Two questions:
> >> > > > > > > >> > > > > > > > > Do we have use case 3? Is it nice to have
> >> or
> >> > > must
> >> > > > > > have?
> >> > > > > > > >> > > > > > > > > If we have use case 3 and want to go with
> >> > > > solution A,
> >> > > > > > > >> should
> >> > > > > > > >> > we
> >> > > > > > > >> > > > > > > introduce
> >> > > > > > > >> > > > > > > > > a new configuration to enforce deletion
> by
> >> > > > timestamp?
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi
> >> wu <
> >> > > > > > > >> > > xiongqiwu@gmail.com
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > > > wrote:
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > >> Dong,
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> Thanks for the comment.
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> There are two retention policy: log
> >> > compaction
> >> > > > and
> >> > > > > > time
> >> > > > > > > >> > based
> >> > > > > > > >> > > > > > > retention.
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> Log compaction:
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> we have use cases to keep infinite
> >> retention
> >> > > of a
> >> > > > > > topic
> >> > > > > > > >> > (only
> >> > > > > > > >> > > > > > > > >> compaction). GDPR cares about deletion
> of
> >> PII
> >> > > > > > (personal
> >> > > > > > > >> > > > > identifiable
> >> > > > > > > >> > > > > > > > >> information) data.
> >> > > > > > > >> > > > > > > > >> Since Kafka doesn't know what records
> >> contain
> >> > > > PII, it
> >> > > > > > > >> relies
> >> > > > > > > >> > > on
> >> > > > > > > >> > > > > upper
> >> > > > > > > >> > > > > > > > >> layer to delete those records.
> >> > > > > > > >> > > > > > > > >> For those infinite retention uses uses,
> >> kafka
> >> > > > needs
> >> > > > > > to
> >> > > > > > > >> > > provide a
> >> > > > > > > >> > > > > way
> >> > > > > > > >> > > > > > > to
> >> > > > > > > >> > > > > > > > >> enforce compaction on time. This is what
> >> we
> >> > try
> >> > > > to
> >> > > > > > > >> address
> >> > > > > > > >> > in
> >> > > > > > > >> > > > this
> >> > > > > > > >> > > > > > > KIP.
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> Time based retention,
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> There are also use cases that users of
> >> Kafka
> >> > > > might
> >> > > > > > want
> >> > > > > > > >> to
> >> > > > > > > >> > > > expire
> >> > > > > > > >> > > > > all
> >> > > > > > > >> > > > > > > > >> their data.
> >> > > > > > > >> > > > > > > > >> In those cases, they can use time based
> >> > > > retention of
> >> > > > > > > >> their
> >> > > > > > > >> > > > topics.
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> Regarding your first question, if a user
> >> > wants
> >> > > to
> >> > > > > > > delete
> >> > > > > > > >> a
> >> > > > > > > >> > key
> >> > > > > > > >> > > > in
> >> > > > > > > >> > > > > the
> >> > > > > > > >> > > > > > > > >> log compaction topic, the user has to
> >> send a
> >> > > > deletion
> >> > > > > > > >> using
> >> > > > > > > >> > > the
> >> > > > > > > >> > > > > same
> >> > > > > > > >> > > > > > > > key.
> >> > > > > > > >> > > > > > > > >> Kafka only makes sure the deletion will
> >> > happen
> >> > > > under
> >> > > > > > a
> >> > > > > > > >> > certain
> >> > > > > > > >> > > > > time
> >> > > > > > > >> > > > > > > > >> periods (like 2 days/7 days).
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> Regarding your second question. In most
> >> > cases,
> >> > > we
> >> > > > > > might
> >> > > > > > > >> want
> >> > > > > > > >> > > to
> >> > > > > > > >> > > > > delete
> >> > > > > > > >> > > > > > > > >> all duplicated keys at the same time.
> >> > > > > > > >> > > > > > > > >> Compaction might be more efficient since
> >> we
> >> > > need
> >> > > > to
> >> > > > > > > scan
> >> > > > > > > >> the
> >> > > > > > > >> > > log
> >> > > > > > > >> > > > > and
> >> > > > > > > >> > > > > > > > find
> >> > > > > > > >> > > > > > > > >> all duplicates. However, the expected
> use
> >> > case
> >> > > > is to
> >> > > > > > > set
> >> > > > > > > >> the
> >> > > > > > > >> > > > time
> >> > > > > > > >> > > > > > > based
> >> > > > > > > >> > > > > > > > >> compaction interval on the order of
> days,
> >> and
> >> > > be
> >> > > > > > larger
> >> > > > > > > >> than
> >> > > > > > > >> > > > 'min
> >> > > > > > > >> > > > > > > > >> compaction lag". We don't want log
> >> compaction
> >> > > to
> >> > > > > > happen
> >> > > > > > > >> > > > frequently
> >> > > > > > > >> > > > > > > since
> >> > > > > > > >> > > > > > > > >> it is expensive. The purpose is to help
> >> low
> >> > > > > > production
> >> > > > > > > >> rate
> >> > > > > > > >> > > > topic
> >> > > > > > > >> > > > > to
> >> > > > > > > >> > > > > > > get
> >> > > > > > > >> > > > > > > > >> compacted on time. For the topic with
> >> > "normal"
> >> > > > > > incoming
> >> > > > > > > >> > > message
> >> > > > > > > >> > > > > > > message
> >> > > > > > > >> > > > > > > > >> rate, the "min dirty ratio" might have
> >> > > triggered
> >> > > > the
> >> > > > > > > >> > > compaction
> >> > > > > > > >> > > > > before
> >> > > > > > > >> > > > > > > > this
> >> > > > > > > >> > > > > > > > >> time based compaction policy takes
> effect.
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> Eno,
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> For your question, like I mentioned we
> >> have
> >> > > long
> >> > > > time
> >> > > > > > > >> > > retention
> >> > > > > > > >> > > > > use
> >> > > > > > > >> > > > > > > case
> >> > > > > > > >> > > > > > > > >> for log compacted topic, but we want to
> >> > provide
> >> > > > > > ability
> >> > > > > > > >> to
> >> > > > > > > >> > > > delete
> >> > > > > > > >> > > > > > > > certain
> >> > > > > > > >> > > > > > > > >> PII records on time.
> >> > > > > > > >> > > > > > > > >> Kafka itself doesn't know whether a
> record
> >> > > > contains
> >> > > > > > > >> > sensitive
> >> > > > > > > >> > > > > > > > information
> >> > > > > > > >> > > > > > > > >> and relies on the user for deletion.
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong
> Lin
> >> <
> >> > > > > > > >> > > lindong28@gmail.com>
> >> > > > > > > >> > > > > > > wrote:
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >>> Hey Xiongqi,
> >> > > > > > > >> > > > > > > > >>>
> >> > > > > > > >> > > > > > > > >>> Thanks for the KIP. I have two
> questions
> >> > > > regarding
> >> > > > > > the
> >> > > > > > > >> > > use-case
> >> > > > > > > >> > > > > for
> >> > > > > > > >> > > > > > > > >>> meeting
> >> > > > > > > >> > > > > > > > >>> GDPR requirement.
> >> > > > > > > >> > > > > > > > >>>
> >> > > > > > > >> > > > > > > > >>> 1) If I recall correctly, one of the
> GDPR
> >> > > > > > requirement
> >> > > > > > > is
> >> > > > > > > >> > that
> >> > > > > > > >> > > > we
> >> > > > > > > >> > > > > can
> >> > > > > > > >> > > > > > > > not
> >> > > > > > > >> > > > > > > > >>> keep messages longer than e.g. 30 days
> in
> >> > > > storage
> >> > > > > > > (e.g.
> >> > > > > > > >> > > Kafka).
> >> > > > > > > >> > > > > Say
> >> > > > > > > >> > > > > > > > there
> >> > > > > > > >> > > > > > > > >>> exists a partition p0 which contains
> >> > message1
> >> > > > with
> >> > > > > > > key1
> >> > > > > > > >> and
> >> > > > > > > >> > > > > message2
> >> > > > > > > >> > > > > > > > with
> >> > > > > > > >> > > > > > > > >>> key2. And then user keeps producing
> >> messages
> >> > > > with
> >> > > > > > > >> key=key2
> >> > > > > > > >> > to
> >> > > > > > > >> > > > > this
> >> > > > > > > >> > > > > > > > >>> partition. Since message1 with key1 is
> >> never
> >> > > > > > > overridden,
> >> > > > > > > >> > > sooner
> >> > > > > > > >> > > > > or
> >> > > > > > > >> > > > > > > > later
> >> > > > > > > >> > > > > > > > >>> we
> >> > > > > > > >> > > > > > > > >>> will want to delete message1 and keep
> the
> >> > > latest
> >> > > > > > > message
> >> > > > > > > >> > with
> >> > > > > > > >> > > > > > > key=key2.
> >> > > > > > > >> > > > > > > > >>> But
> >> > > > > > > >> > > > > > > > >>> currently it looks like log compact
> >> logic in
> >> > > > Kafka
> >> > > > > > > will
> >> > > > > > > >> > > always
> >> > > > > > > >> > > > > put
> >> > > > > > > >> > > > > > > > these
> >> > > > > > > >> > > > > > > > >>> messages in the same segment. Will this
> >> be
> >> > an
> >> > > > issue?
> >> > > > > > > >> > > > > > > > >>>
> >> > > > > > > >> > > > > > > > >>> 2) The current KIP intends to provide
> the
> >> > > > capability
> >> > > > > > > to
> >> > > > > > > >> > > delete
> >> > > > > > > >> > > > a
> >> > > > > > > >> > > > > > > given
> >> > > > > > > >> > > > > > > > >>> message in log compacted topic. Does
> such
> >> > > > use-case
> >> > > > > > > also
> >> > > > > > > >> > > require
> >> > > > > > > >> > > > > Kafka
> >> > > > > > > >> > > > > > > > to
> >> > > > > > > >> > > > > > > > >>> keep the messages produced before the
> >> given
> >> > > > message?
> >> > > > > > > If
> >> > > > > > > >> > yes,
> >> > > > > > > >> > > > > then we
> >> > > > > > > >> > > > > > > > can
> >> > > > > > > >> > > > > > > > >>> probably just use
> >> > AdminClient.deleteRecords()
> >> > > or
> >> > > > > > > >> time-based
> >> > > > > > > >> > > log
> >> > > > > > > >> > > > > > > > retention
> >> > > > > > > >> > > > > > > > >>> to meet the use-case requirement. If
> no,
> >> do
> >> > > you
> >> > > > know
> >> > > > > > > >> what
> >> > > > > > > >> > is
> >> > > > > > > >> > > > the
> >> > > > > > > >> > > > > > > GDPR's
> >> > > > > > > >> > > > > > > > >>> requirement on time-to-deletion after
> >> user
> >> > > > > > explicitly
> >> > > > > > > >> > > requests
> >> > > > > > > >> > > > > the
> >> > > > > > > >> > > > > > > > >>> deletion
> >> > > > > > > >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> >> > > > > > > >> > > > > > > > >>>
> >> > > > > > > >> > > > > > > > >>> Thanks,
> >> > > > > > > >> > > > > > > > >>> Dong
> >> > > > > > > >> > > > > > > > >>>
> >> > > > > > > >> > > > > > > > >>>
> >> > > > > > > >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM,
> xiongqi
> >> wu
> >> > <
> >> > > > > > > >> > > > xiongqiwu@gmail.com
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > > > wrote:
> >> > > > > > > >> > > > > > > > >>>
> >> > > > > > > >> > > > > > > > >>> > Hi Eno,
> >> > > > > > > >> > > > > > > > >>> >
> >> > > > > > > >> > > > > > > > >>> > The GDPR request we are getting here
> at
> >> > > > linkedin
> >> > > > > > is
> >> > > > > > > >> if we
> >> > > > > > > >> > > > get a
> >> > > > > > > >> > > > > > > > >>> request to
> >> > > > > > > >> > > > > > > > >>> > delete a record through a null key
> on a
> >> > log
> >> > > > > > > compacted
> >> > > > > > > >> > > topic,
> >> > > > > > > >> > > > > > > > >>> > we want to delete the record via
> >> > compaction
> >> > > > in a
> >> > > > > > > given
> >> > > > > > > >> > time
> >> > > > > > > >> > > > > period
> >> > > > > > > >> > > > > > > > >>> like 2
> >> > > > > > > >> > > > > > > > >>> > days (whatever is required by the
> >> policy).
> >> > > > > > > >> > > > > > > > >>> >
> >> > > > > > > >> > > > > > > > >>> > There might be other issues (such as
> >> > orphan
> >> > > > log
> >> > > > > > > >> segments
> >> > > > > > > >> > > > under
> >> > > > > > > >> > > > > > > > certain
> >> > > > > > > >> > > > > > > > >>> > conditions) that lead to GDPR problem
> >> but
> >> > > > they are
> >> > > > > > > >> more
> >> > > > > > > >> > > like
> >> > > > > > > >> > > > > > > > >>> something we
> >> > > > > > > >> > > > > > > > >>> > need to fix anyway regardless of
> GDPR.
> >> > > > > > > >> > > > > > > > >>> >
> >> > > > > > > >> > > > > > > > >>> >
> >> > > > > > > >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
> >> > > > > > > >> > > > > > > > >>> >
> >> > > > > > > >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno
> >> > > Thereska
> >> > > > <
> >> > > > > > > >> > > > > > > > eno.thereska@gmail.com>
> >> > > > > > > >> > > > > > > > >>> > wrote:
> >> > > > > > > >> > > > > > > > >>> >
> >> > > > > > > >> > > > > > > > >>> > > Hello,
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > > Thanks for the KIP. I'd like to
> see a
> >> > more
> >> > > > > > precise
> >> > > > > > > >> > > > > definition of
> >> > > > > > > >> > > > > > > > what
> >> > > > > > > >> > > > > > > > >>> > part
> >> > > > > > > >> > > > > > > > >>> > > of GDPR you are targeting as well
> as
> >> > some
> >> > > > sort
> >> > > > > > of
> >> > > > > > > >> > > > > verification
> >> > > > > > > >> > > > > > > that
> >> > > > > > > >> > > > > > > > >>> this
> >> > > > > > > >> > > > > > > > >>> > > KIP actually addresses the problem.
> >> > Right
> >> > > > now I
> >> > > > > > > find
> >> > > > > > > >> > > this a
> >> > > > > > > >> > > > > bit
> >> > > > > > > >> > > > > > > > >>> vague:
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > > "Ability to delete a log message
> >> through
> >> > > > > > > compaction
> >> > > > > > > >> in
> >> > > > > > > >> > a
> >> > > > > > > >> > > > > timely
> >> > > > > > > >> > > > > > > > >>> manner
> >> > > > > > > >> > > > > > > > >>> > has
> >> > > > > > > >> > > > > > > > >>> > > become an important requirement in
> >> some
> >> > > use
> >> > > > > > cases
> >> > > > > > > >> > (e.g.,
> >> > > > > > > >> > > > > GDPR)"
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > > Is there any guarantee that after
> >> this
> >> > KIP
> >> > > > the
> >> > > > > > > GDPR
> >> > > > > > > >> > > problem
> >> > > > > > > >> > > > > is
> >> > > > > > > >> > > > > > > > >>> solved or
> >> > > > > > > >> > > > > > > > >>> > do
> >> > > > > > > >> > > > > > > > >>> > > we need to do something else as
> well,
> >> > > e.g.,
> >> > > > more
> >> > > > > > > >> KIPs?
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > > Thanks
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > > Eno
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM,
> >> xiongqi
> >> > > wu <
> >> > > > > > > >> > > > > xiongqiwu@gmail.com>
> >> > > > > > > >> > > > > > > > >>> wrote:
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> > > > Hi Kafka,
> >> > > > > > > >> > > > > > > > >>> > > >
> >> > > > > > > >> > > > > > > > >>> > > > This KIP tries to address GDPR
> >> concern
> >> > > to
> >> > > > > > > fulfill
> >> > > > > > > >> > > > deletion
> >> > > > > > > >> > > > > > > > request
> >> > > > > > > >> > > > > > > > >>> on
> >> > > > > > > >> > > > > > > > >>> > > time
> >> > > > > > > >> > > > > > > > >>> > > > through time-based log compaction
> >> on a
> >> > > > > > > compaction
> >> > > > > > > >> > > enabled
> >> > > > > > > >> > > > > > > topic:
> >> > > > > > > >> > > > > > > > >>> > > >
> >> > > > > > > >> > > > > > > > >>> > > >
> >> > > > > > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> >> > > > > > > >> > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> >> > > > > > > >> > > > > > > > <
> >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> >> > > > > > > >> > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>>
> >> > > > > > > >> > > > > > > > >>> > > >
> >> > 354%3A+Time-based+log+compaction+policy
> >> > > > > > > >> > > > > > > > >>> > > >
> >> > > > > > > >> > > > > > > > >>> > > > Any feedback will be appreciated.
> >> > > > > > > >> > > > > > > > >>> > > >
> >> > > > > > > >> > > > > > > > >>> > > >
> >> > > > > > > >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
> >> > > > > > > >> > > > > > > > >>> > > >
> >> > > > > > > >> > > > > > > > >>> > >
> >> > > > > > > >> > > > > > > > >>> >
> >> > > > > > > >> > > > > > > > >>>
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >> --
> >> > > > > > > >> > > > > > > > >> Xiongqi (Wesley) Wu
> >> > > > > > > >> > > > > > > > >>
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > --
> >> > > > > > > >> > > > > > > > > Xiongqi (Wesley) Wu
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > > --
> >> > > > > > > >> > > > > > > > Xiongqi (Wesley) Wu
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > --
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > Brett Rann
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > Senior DevOps Engineer
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > Zendesk International Ltd
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > 395 Collins Street, Melbourne VIC 3000
> >> Australia
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > Mobile: +61 (0) 418 826 017
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > --
> >> > > > > > > >> > > > > > Xiongqi (Wesley) Wu
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > >
> >> > > > > > > >> > > >
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > --
> >> > > > > > > >> > > > Xiongqi (Wesley) Wu
> >> > > > > > > >> > > >
> >> > > > > > > >> > >
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > --
> >> > > > > > > >> > Xiongqi (Wesley) Wu
> >> > > > > > > >> >
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> --
> >> > > > > > > >>
> >> > > > > > > >> Brett Rann
> >> > > > > > > >>
> >> > > > > > > >> Senior DevOps Engineer
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> Zendesk International Ltd
> >> > > > > > > >>
> >> > > > > > > >> 395 Collins Street, Melbourne VIC 3000 Australia
> >> > > > > > > >>
> >> > > > > > > >> Mobile: +61 (0) 418 826 017
> >> > > > > > > >>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > > Xiongqi (Wesley) Wu
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > >
> >> > > > > > Brett Rann
> >> > > > > >
> >> > > > > > Senior DevOps Engineer
> >> > > > > >
> >> > > > > >
> >> > > > > > Zendesk International Ltd
> >> > > > > >
> >> > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> >> > > > > >
> >> > > > > > Mobile: +61 (0) 418 826 017
> >> > > > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > -Regards,
> >> > > Mayuresh R. Gharat
> >> > > (862) 250-7125
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-354 Time-based log compaction policy

Posted by xiongqi wu <xi...@gmail.com>.
Hi Dong,
I have updated the KIP to address your comments.
One correction to previous Email:
after offline discussion with Dong,  we decide to use MAX_LONG as default
value for max.compaction.lag.ms.


Xiongqi (Wesley) Wu


On Mon, Oct 29, 2018 at 12:15 PM xiongqi wu <xi...@gmail.com> wrote:

> Hi Dong,
>
> Thank you for your comment.  See my inline comments.
> I will update the KIP shortly.
>
> Xiongqi (Wesley) Wu
>
>
> On Sun, Oct 28, 2018 at 9:17 PM Dong Lin <li...@gmail.com> wrote:
>
>> Hey Xiongqi,
>>
>> Sorry for late reply. I have some comments below:
>>
>> 1) As discussed earlier in the email list, if the topic is configured with
>> both deletion and compaction, in some cases messages produced a long time
>> ago can not be deleted based on time. This is a valid use-case because we
>> actually have topic which is configured with both deletion and compaction
>> policy. And we should enforce the semantics for both policy. Solution A
>> sounds good. We do not need interface change (e.g. extra config) to
>> enforce
>> solution A. All we need is to update implementation so that when broker
>> compacts a topic, if the message has timestamp (which is the common case),
>> messages that are too old (based on the time-based retention config) will
>> be discarded. Since this is a valid issue and it is also related to the
>> guarantee of when a message can be deleted, can we include the solution of
>> this problem in the KIP?
>>
> ======  This makes sense.  We can use similar approach to increase the log
> start offset.
>
>>
>> 2) It is probably OK to assume that all messages have timestamp. The
>> per-message timestamp was introduced into Kafka 0.10.0 with KIP-31 and
>> KIP-32 as of Feb 2016. Kafka 0.10.0 or earlier versions are no longer
>> supported. Also, since the use-case for this feature is primarily for
>> GDPR,
>> we can assume that client library has already been upgraded to support
>> SSL,
>> which feature is added after KIP-31 and KIP-32.
>>
>>  =========>  Ok. We can use message timestamp to delete expired records
> if both compaction and retention are enabled.
>
>
> 3) In Proposed Change section 2.a, it is said that segment.largestTimestamp
>> - maxSegmentMs can be used to determine the timestamp of the earliest
>> message. Would it be simpler to just use the create time of the file to
>> determine the time?
>>
>> ========>  Linux/Java doesn't provide API for file creation time because
> some filesystem type doesn't provide file creation time.
>
>
>> 4) The KIP suggests to use must-clean-ratio to select the partition to be
>> compacted. Unlike dirty ratio which is mostly for performance, the logs
>> whose "must-clean-ratio" is non-zero must be compacted immediately for
>> correctness reason (and for GDPR). And if this can no be achieved because
>> e.g. broker compaction throughput is too low, investigation will be
>> needed.
>> So it seems simpler to first compact logs which has segment whose earliest
>> timetamp is earlier than now - max.compaction.lag.ms, instead of defining
>> must-clean-ratio and sorting logs based on this value.
>>
>>
> ======>  Good suggestion. This can simply the implementation quite a bit
> if we are not too concerned about compaction of GDPR required partition
> queued behind some large partition.  The actual compaction completion time
> is not guaranteed anyway.
>
>
>> 5) The KIP says max.compaction.lag.ms is 0 by default and it is also
>> suggested that 0 means disable. Should we set this value to MAX_LONG by
>> default to effectively disable the feature added in this KIP?
>>
>> ====> I would rather use 0 so the corresponding code path will not be
> exercised.  By using MAX_LONG, we would theoretically go through related
> code to find out whether the partition is required to be compacted to
> satisfy MAX_LONG.
>
> 6) It is probably cleaner and readable not to include in Public Interface
>> section those configs whose meaning is not changed.
>>
>> ====> I will clean that up.
>
> 7) The goal of this KIP is to ensure that log segment whose earliest
>> message is earlier than a given threshold will be compacted. This goal may
>> not be achieved if the compact throughput can not catchup with the total
>> bytes-in-rate for the compacted topics on the broker. Thus we need an easy
>> way to tell operator whether this goal is achieved. If we don't already
>> have such metric, maybe we can include metrics to show 1) the total number
>> of log segments (or logs) which needs to be immediately compacted as
>> determined by max.compaction.lag; and 2) the maximum value of now -
>> earliest_time_stamp_of_segment among all segments that needs to be
>> compacted.
>>
>> =======> good suggestion.  I will update KIP for these metrics.
>
> 8) The Performance Impact suggests user to use the existing metrics to
>> monitor the performance impact of this KIP. It i useful to list mean of
>> each jmx metrics that we want user to monitor, and possibly explain how to
>> interpret the value of these metrics to determine whether there is
>> performance issue.
>>
>> =========>  I will update the KIP.
>
>> Thanks,
>> Dong
>>
>> On Tue, Oct 16, 2018 at 10:53 AM xiongqi wu <xi...@gmail.com> wrote:
>>
>> > Mayuresh,
>> >
>> > Thanks for the comments.
>> > The requirement is that we need to pick up segments that are older than
>> > maxCompactionLagMs for compaction.
>> > maxCompactionLagMs is an upper-bound, which implies that picking up
>> > segments for compaction earlier doesn't violated the policy.
>> > We use the creation time of a segment as an estimation of its records
>> > arrival time, so these records can be compacted no later than
>> > maxCompactionLagMs.
>> >
>> > On the other hand, compaction is an expensive operation, we don't want
>> to
>> > compact the log partition whenever a new segment is sealed.
>> > Therefore, we want to pick up a segment for compaction when the segment
>> is
>> > closed to mandatory max compaction lag (so we use segment creation time
>> as
>> > an estimation.)
>> >
>> >
>> > Xiongqi (Wesley) Wu
>> >
>> >
>> > On Mon, Oct 15, 2018 at 5:54 PM Mayuresh Gharat <
>> > gharatmayuresh15@gmail.com>
>> > wrote:
>> >
>> > > Hi Wesley,
>> > >
>> > > Thanks for the KIP and sorry for being late to the party.
>> > >  I wanted to understand, the scenario you mentioned in Proposed
>> changes :
>> > >
>> > > -
>> > > >
>> > > > Estimate the earliest message timestamp of an un-compacted log
>> segment.
>> > > we
>> > > > only need to estimate earliest message timestamp for un-compacted
>> log
>> > > > segments to ensure timely compaction because the deletion requests
>> that
>> > > > belong to compacted segments have already been processed.
>> > > >
>> > > >    1.
>> > > >
>> > > >    for the first (earliest) log segment:  The estimated earliest
>> > > >    timestamp is set to the timestamp of the first message if
>> timestamp
>> > is
>> > > >    present in the message. Otherwise, the estimated earliest
>> timestamp
>> > > is set
>> > > >    to "segment.largestTimestamp - maxSegmentMs”
>> > > >     (segment.largestTimestamp is lastModified time of the log
>> segment
>> > or
>> > > max
>> > > >    timestamp we see for the log segment.). In the later case, the
>> > actual
>> > > >    timestamp of the first message might be later than the
>> estimation,
>> > > but it
>> > > >    is safe to pick up the log for compaction earlier.
>> > > >
>> > > > When we say "actual timestamp of the first message might be later
>> than
>> > > the
>> > > estimation, but it is safe to pick up the log for compaction
>> earlier.",
>> > > doesn't that violate the assumption that we will consider a segment
>> for
>> > > compaction only if the time of creation the segment has crossed the
>> "now
>> > -
>> > > maxCompactionLagMs" ?
>> > >
>> > > Thanks,
>> > >
>> > > Mayuresh
>> > >
>> > > On Mon, Sep 3, 2018 at 7:28 PM Brett Rann <br...@zendesk.com.invalid>
>> > > wrote:
>> > >
>> > > > Might also be worth moving to a vote thread? Discussion seems to
>> have
>> > > gone
>> > > > as far as it can.
>> > > >
>> > > > > On 4 Sep 2018, at 12:08, xiongqi wu <xi...@gmail.com> wrote:
>> > > > >
>> > > > > Brett,
>> > > > >
>> > > > > Yes, I will post PR tomorrow.
>> > > > >
>> > > > > Xiongqi (Wesley) Wu
>> > > > >
>> > > > >
>> > > > > On Sun, Sep 2, 2018 at 6:28 PM Brett Rann
>> <brann@zendesk.com.invalid
>> > >
>> > > > wrote:
>> > > > >
>> > > > > > +1 (non-binding) from me on the interface. I'd like to see
>> someone
>> > > > familiar
>> > > > > > with
>> > > > > > the code comment on the approach, and note there's a couple of
>> > > > different
>> > > > > > approaches: what's documented in the KIP, and what Xiaohe Dong
>> was
>> > > > working
>> > > > > > on
>> > > > > > here:
>> > > > > >
>> > > > > >
>> > > >
>> > >
>> >
>> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0
>> > > > > >
>> > > > > > If you have code working already Xiongqi Wu could you share a
>> PR?
>> > I'd
>> > > > be
>> > > > > > happy
>> > > > > > to start testing.
>> > > > > >
>> > > > > > On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <xiongqiwu@gmail.com
>> >
>> > > > wrote:
>> > > > > >
>> > > > > > > Hi All,
>> > > > > > >
>> > > > > > > Do you have any additional comments on this KIP?
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <
>> xiongqiwu@gmail.com
>> > >
>> > > > wrote:
>> > > > > > >
>> > > > > > > > on 2)
>> > > > > > > > The offsetmap is built starting from dirty segment.
>> > > > > > > > The compaction starts from the beginning of the log
>> partition.
>> > > > That's
>> > > > > > how
>> > > > > > > > it ensure the deletion of tomb keys.
>> > > > > > > > I will double check tomorrow.
>> > > > > > > >
>> > > > > > > > Xiongqi (Wesley) Wu
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann
>> > > > <br...@zendesk.com.invalid>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > >> To just clarify a bit on 1. whether there's an external
>> > > storage/DB
>> > > > > > isn't
>> > > > > > > >> relevant here.
>> > > > > > > >> Compacted topics allow a tombstone record to be sent (a
>> null
>> > > value
>> > > > > > for a
>> > > > > > > >> key) which
>> > > > > > > >> currently will result in old values for that key being
>> deleted
>> > > if
>> > > > some
>> > > > > > > >> conditions are met.
>> > > > > > > >> There are existing controls to make sure the old values
>> will
>> > > stay
>> > > > > > around
>> > > > > > > >> for a minimum
>> > > > > > > >> time at least, but no dedicated control to ensure the
>> > tombstone
>> > > > will
>> > > > > > > >> delete
>> > > > > > > >> within a
>> > > > > > > >> maximum time.
>> > > > > > > >>
>> > > > > > > >> One popular reason that maximum time for deletion is
>> desirable
>> > > > right
>> > > > > > now
>> > > > > > > >> is
>> > > > > > > >> GDPR with
>> > > > > > > >> PII. But we're not proposing any GDPR awareness in kafka,
>> just
>> > > > being
>> > > > > > > able
>> > > > > > > >> to guarantee
>> > > > > > > >> a max time where a tombstoned key will be removed from the
>> > > > compacted
>> > > > > > > >> topic.
>> > > > > > > >>
>> > > > > > > >> on 2)
>> > > > > > > >> huh, i thought it kept track of the first dirty segment and
>> > > didn't
>> > > > > > > >> recompact older "clean" ones.
>> > > > > > > >> But I didn't look at code or test for that.
>> > > > > > > >>
>> > > > > > > >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <
>> > > xiongqiwu@gmail.com>
>> > > > > > > wrote:
>> > > > > > > >>
>> > > > > > > >> > 1, Owner of data (in this sense, kafka is the not the
>> owner
>> > of
>> > > > data)
>> > > > > > > >> > should keep track of lifecycle of the data in some
>> external
>> > > > > > > storage/DB.
>> > > > > > > >> > The owner determines when to delete the data and send the
>> > > delete
>> > > > > > > >> request to
>> > > > > > > >> > kafka. Kafka doesn't know about the content of data but
>> to
>> > > > provide a
>> > > > > > > >> mean
>> > > > > > > >> > for deletion.
>> > > > > > > >> >
>> > > > > > > >> > 2 , each time compaction runs, it will start from first
>> > > > segments (no
>> > > > > > > >> > matter if it is compacted or not). The time estimation
>> here
>> > is
>> > > > only
>> > > > > > > used
>> > > > > > > >> > to determine whether we should run compaction on this log
>> > > > partition.
>> > > > > > > So
>> > > > > > > >> we
>> > > > > > > >> > only need to estimate uncompacted segments.
>> > > > > > > >> >
>> > > > > > > >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <
>> > > lindong28@gmail.com>
>> > > > > > > wrote:
>> > > > > > > >> >
>> > > > > > > >> > > Hey Xiongqi,
>> > > > > > > >> > >
>> > > > > > > >> > > Thanks for the update. I have two questions for the
>> latest
>> > > > KIP.
>> > > > > > > >> > >
>> > > > > > > >> > > 1) The motivation section says that one use case is to
>> > > delete
>> > > > PII
>> > > > > > > >> > (Personal
>> > > > > > > >> > > Identifiable information) data within 7 days while
>> keeping
>> > > > non-PII
>> > > > > > > >> > > indefinitely in compacted format. I suppose the
>> use-case
>> > > > depends
>> > > > > > on
>> > > > > > > >> the
>> > > > > > > >> > > application to determine when to delete those PII data.
>> > > Could
>> > > > you
>> > > > > > > >> explain
>> > > > > > > >> > > how can application reliably determine the set of keys
>> > that
>> > > > should
>> > > > > > > be
>> > > > > > > >> > > deleted? Is application required to always messages
>> from
>> > the
>> > > > topic
>> > > > > > > >> after
>> > > > > > > >> > > every restart and determine the keys to be deleted by
>> > > looking
>> > > > at
>> > > > > > > >> message
>> > > > > > > >> > > timestamp, or is application supposed to persist the
>> key->
>> > > > > > timstamp
>> > > > > > > >> > > information in a separate persistent storage system?
>> > > > > > > >> > >
>> > > > > > > >> > > 2) It is mentioned in the KIP that "we only need to
>> > estimate
>> > > > > > > earliest
>> > > > > > > >> > > message timestamp for un-compacted log segments because
>> > the
>> > > > > > deletion
>> > > > > > > >> > > requests that belong to compacted segments have already
>> > been
>> > > > > > > >> processed".
>> > > > > > > >> > > Not sure if it is correct. If a segment is compacted
>> > before
>> > > > user
>> > > > > > > sends
>> > > > > > > >> > > message to delete a key in this segment, it seems that
>> we
>> > > > still
>> > > > > > need
>> > > > > > > >> to
>> > > > > > > >> > > ensure that the segment will be compacted again within
>> the
>> > > > given
>> > > > > > > time
>> > > > > > > >> > after
>> > > > > > > >> > > the deletion is requested, right?
>> > > > > > > >> > >
>> > > > > > > >> > > Thanks,
>> > > > > > > >> > > Dong
>> > > > > > > >> > >
>> > > > > > > >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <
>> > > > xiongqiwu@gmail.com
>> > > > > > >
>> > > > > > > >> > wrote:
>> > > > > > > >> > >
>> > > > > > > >> > > > Hi Xiaohe,
>> > > > > > > >> > > >
>> > > > > > > >> > > > Quick note:
>> > > > > > > >> > > > 1) Use minimum of segment.ms and
>> max.compaction.lag.ms
>> > > > > > > >> > > > <http://max.compaction.ms
>> > > > > > > <http://max.compaction.ms>
>> > > > > > > >> > <http://max.compaction.ms
>> > > > > > > <http://max.compaction.ms>>>
>> > > > > > > >> > > >
>> > > > > > > >> > > > 2) I am not sure if I get your second question.
>> first,
>> > we
>> > > > have
>> > > > > > > >> jitter
>> > > > > > > >> > > when
>> > > > > > > >> > > > we roll the active segment. second, on each
>> compaction,
>> > we
>> > > > > > compact
>> > > > > > > >> upto
>> > > > > > > >> > > > the offsetmap could allow. Those will not lead to
>> > perfect
>> > > > > > > compaction
>> > > > > > > >> > > storm
>> > > > > > > >> > > > overtime. In addition, I expect we are setting
>> > > > > > > >> max.compaction.lag.ms
>> > > > > > > >> > on
>> > > > > > > >> > > > the order of days.
>> > > > > > > >> > > >
>> > > > > > > >> > > > 3) I don't have access to the confluent community
>> slack
>> > > for
>> > > > > > now. I
>> > > > > > > >> am
>> > > > > > > >> > > > reachable via the google handle out.
>> > > > > > > >> > > > To avoid the double effort, here is my plan:
>> > > > > > > >> > > > a) Collect more feedback and feature requriement on
>> the
>> > > KIP.
>> > > > > > > >> > > > b) Wait unitl this KIP is approved.
>> > > > > > > >> > > > c) I will address any additional requirements in the
>> > > > > > > implementation.
>> > > > > > > >> > (My
>> > > > > > > >> > > > current implementation only complies to whatever
>> > described
>> > > > in
>> > > > > > the
>> > > > > > > >> KIP
>> > > > > > > >> > > now)
>> > > > > > > >> > > > d) I can share the code with the you and community
>> see
>> > you
>> > > > want
>> > > > > > to
>> > > > > > > >> add
>> > > > > > > >> > > > anything.
>> > > > > > > >> > > > e) submission through committee
>> > > > > > > >> > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
>> > > > > > > >> dannyrivclo@gmail.com>
>> > > > > > > >> > > > wrote:
>> > > > > > > >> > > >
>> > > > > > > >> > > > > Hi Xiongqi
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > Thanks for thinking about implementing this as
>> well.
>> > :)
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > I was thinking about using `segment.ms` to trigger
>> > the
>> > > > > > segment
>> > > > > > > >> roll.
>> > > > > > > >> > > > > Also, its value can be the largest time bias for
>> the
>> > > > record
>> > > > > > > >> deletion.
>> > > > > > > >> > > For
>> > > > > > > >> > > > > example, if the `segment.ms` is 1 day and `
>> > > > max.compaction.ms`
>> > > > > > > is
>> > > > > > > >> 30
>> > > > > > > >> > > > days,
>> > > > > > > >> > > > > the compaction may happen around 31 days.
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > For my curiosity, is there a way we can do some
>> > > > performance
>> > > > > > test
>> > > > > > > >> for
>> > > > > > > >> > > this
>> > > > > > > >> > > > > and any tools you can recommend. As you know,
>> > > previously,
>> > > > it
>> > > > > > is
>> > > > > > > >> > cleaned
>> > > > > > > >> > > > up
>> > > > > > > >> > > > > by respecting dirty ratio, but now it may happen
>> > anytime
>> > > > if
>> > > > > > max
>> > > > > > > >> lag
>> > > > > > > >> > has
>> > > > > > > >> > > > > passed for each message. I wonder what would
>> happen if
>> > > > clients
>> > > > > > > >> send
>> > > > > > > >> > > huge
>> > > > > > > >> > > > > amount of tombstone records at the same time.
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > I am looking forward to have a quick chat with you
>> to
>> > > > avoid
>> > > > > > > double
>> > > > > > > >> > > effort
>> > > > > > > >> > > > > on this. I am in confluent community slack during
>> the
>> > > work
>> > > > > > time.
>> > > > > > > >> My
>> > > > > > > >> > > name
>> > > > > > > >> > > > is
>> > > > > > > >> > > > > Xiaohe Dong. :)
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > Rgds
>> > > > > > > >> > > > > Xiaohe Dong
>> > > > > > > >> > > > >
>> > > > > > > >> > > > >
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > On 2018/08/16 01:22:22, xiongqi wu <
>> > xiongqiwu@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > > > >> > > > > > Brett,
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > Thank you for your comments.
>> > > > > > > >> > > > > > I was thinking since we already has immediate
>> > > compaction
>> > > > > > > >> setting by
>> > > > > > > >> > > > > setting
>> > > > > > > >> > > > > > min dirty ratio to 0, so I decide to use "0" as
>> > > disabled
>> > > > > > > state.
>> > > > > > > >> > > > > > I am ok to go with -1(disable), 0 (immediate)
>> > options.
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > For the implementation, there are a few
>> differences
>> > > > between
>> > > > > > > mine
>> > > > > > > >> > and
>> > > > > > > >> > > > > > "Xiaohe Dong"'s :
>> > > > > > > >> > > > > > 1) I used the estimated creation time of a log
>> > segment
>> > > > > > instead
>> > > > > > > >> of
>> > > > > > > >> > > > largest
>> > > > > > > >> > > > > > timestamp of a log to determine the compaction
>> > > > eligibility,
>> > > > > > > >> > because a
>> > > > > > > >> > > > log
>> > > > > > > >> > > > > > segment might stay as an active segment up to
>> "max
>> > > > > > compaction
>> > > > > > > >> lag".
>> > > > > > > >> > > > (see
>> > > > > > > >> > > > > > the KIP for detail).
>> > > > > > > >> > > > > > 2) I measure how much bytes that we must clean to
>> > > > follow the
>> > > > > > > >> "max
>> > > > > > > >> > > > > > compaction lag" rule, and use that to determine
>> the
>> > > > order of
>> > > > > > > >> > > > compaction.
>> > > > > > > >> > > > > > 3) force active segment to roll to follow the
>> "max
>> > > > > > compaction
>> > > > > > > >> lag"
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > I can share my code so we can coordinate.
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > I haven't think about a new API to force a
>> > compaction.
>> > > > what
>> > > > > > is
>> > > > > > > >> the
>> > > > > > > >> > > use
>> > > > > > > >> > > > > case
>> > > > > > > >> > > > > > for this one?
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
>> > > > > > > >> > > <brann@zendesk.com.invalid
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > > wrote:
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > > We've been looking into this too.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > Mailing list:
>> > > > > > > >> > > > > > > https://lists.apache.org/thread.html/
>> > > > > > > <https://lists.apache.org/thread.html/>
>> > > > > > > >> > <https://lists.apache.org/thread.html/
>> > > > > > > <https://lists.apache.org/thread.html/>>
>> > > > > > > >> > > ed7f6a6589f94e8c2a705553f364ef
>> > > > > > > >> > > > > > > 599cb6915e4c3ba9b561e610e4@%
>> > 3Cdev.kafka.apache.org
>> > > %3E
>> > > > > > > >> > > > > > > jira wish:
>> > > > > > https://issues.apache.org/jira/browse/KAFKA-7137
>> > > > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>
>> > > > > > > >> > <https://issues.apache.org/jira/browse/KAFKA-7137
>> > > > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>>
>> > > > > > > >> > > > > > > confluent slack discussion:
>> > > > > > > >> > > > > > >
>> > > > https://confluentcommunity.slack.com/archives/C49R61XMM/
>> > > > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>
>> > > > > > > >> > <
>> https://confluentcommunity.slack.com/archives/C49R61XMM/
>> > > > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>>
>> > > > > > > >> > > > > p1530760121000039
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > A person on my team has started on code so you
>> > might
>> > > > want
>> > > > > > to
>> > > > > > > >> > > > > coordinate:
>> > > > > > > >> > > > > > >
>> > > > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
>> > > > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
>> > > > > > > >> > <
>> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
>> > > > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->>
>> > > > > > > >> > > > > > > cleaner-compaction-max-lifetime-2.0
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > He's been working with Jason Gustafson and
>> James
>> > > Chen
>> > > > > > around
>> > > > > > > >> the
>> > > > > > > >> > > > > changes.
>> > > > > > > >> > > > > > > You can ping him on confluent slack as Xiaohe
>> > Dong.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > It's great to know others are thinking on it as
>> > > well.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > You've added the requirement to force a segment
>> > roll
>> > > > which
>> > > > > > > we
>> > > > > > > >> > > hadn't
>> > > > > > > >> > > > > gotten
>> > > > > > > >> > > > > > > to yet, which is great. I was content with it
>> not
>> > > > > > including
>> > > > > > > >> the
>> > > > > > > >> > > > active
>> > > > > > > >> > > > > > > segment.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > > Adding topic level configuration "
>> > > > max.compaction.lag.ms
>> > > > > > ",
>> > > > > > > >> and
>> > > > > > > >> > > > > > > corresponding broker configuration "
>> > > > > > > >> > log.cleaner.max.compaction.la
>> > > > > > > >> > > > g.ms
>> > > > > > > >> > > > > ",
>> > > > > > > >> > > > > > > which is set to 0 (disabled) by default.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > Glancing at some other settings convention
>> seems
>> > to
>> > > > me to
>> > > > > > be
>> > > > > > > >> -1
>> > > > > > > >> > for
>> > > > > > > >> > > > > > > disabled (or infinite, which is more meaningful
>> > > > here). 0
>> > > > > > to
>> > > > > > > me
>> > > > > > > >> > > > implies
>> > > > > > > >> > > > > > > instant, a little quicker than 1.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > We've been trying to think about a way to
>> trigger
>> > > > > > compaction
>> > > > > > > >> as
>> > > > > > > >> > > well
>> > > > > > > >> > > > > > > through an API call, which would need to be
>> > flagged
>> > > > > > > somewhere
>> > > > > > > >> (ZK
>> > > > > > > >> > > > > admin/
>> > > > > > > >> > > > > > > space?) but we're struggling to think how that
>> > would
>> > > > be
>> > > > > > > >> > coordinated
>> > > > > > > >> > > > > across
>> > > > > > > >> > > > > > > brokers and partitions. Have you given any
>> thought
>> > > to
>> > > > > > that?
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <
>> > > > > > > >> xiongqiwu@gmail.com>
>> > > > > > > >> > > > > wrote:
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > > Eno, Dong,
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > I have updated the KIP. We decide not to
>> address
>> > > the
>> > > > > > issue
>> > > > > > > >> that
>> > > > > > > >> > > we
>> > > > > > > >> > > > > might
>> > > > > > > >> > > > > > > > have for both compaction and time retention
>> > > enabled
>> > > > > > topics
>> > > > > > > >> (see
>> > > > > > > >> > > the
>> > > > > > > >> > > > > > > > rejected alternative item 2). This KIP will
>> only
>> > > > ensure
>> > > > > > > log
>> > > > > > > >> can
>> > > > > > > >> > > be
>> > > > > > > >> > > > > > > > compacted after a specified time-interval.
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > As suggested by Dong, we will also enforce "
>> > > > > > > >> > > max.compaction.lag.ms"
>> > > > > > > >> > > > > is
>> > > > > > > >> > > > > > > not
>> > > > > > > >> > > > > > > > less than "min.compaction.lag.ms".
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > >
>> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
>> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
>> > > > > > > >> > <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
>> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
>> > > > > > > >> > > > > Time-based
>> > > > > > > >> > > > > > > log
>> > > > > > > >> > > > > > > > compaction policy
>> > > > > > > >> > > > > > > > <
>> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
>> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
>> > > > > > > >> > <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
>> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
>> > > > > > > >> > > > > Time-based
>> > > > > > > >> > > > > > > log compaction policy>
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu <
>> > > > > > > >> > xiongqiwu@gmail.com
>> > > > > > > >> > > >
>> > > > > > > >> > > > > wrote:
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > Per discussion with Dong, he made a very
>> good
>> > > > point
>> > > > > > that
>> > > > > > > >> if
>> > > > > > > >> > > > > compaction
>> > > > > > > >> > > > > > > > > and time based retention are both enabled
>> on a
>> > > > topic,
>> > > > > > > the
>> > > > > > > >> > > > > compaction
>> > > > > > > >> > > > > > > > might
>> > > > > > > >> > > > > > > > > prevent records from being deleted on time.
>> > The
>> > > > reason
>> > > > > > > is
>> > > > > > > >> > when
>> > > > > > > >> > > > > > > compacting
>> > > > > > > >> > > > > > > > > multiple segments into one single segment,
>> the
>> > > > newly
>> > > > > > > >> created
>> > > > > > > >> > > > > segment
>> > > > > > > >> > > > > > > will
>> > > > > > > >> > > > > > > > > have same lastmodified timestamp as latest
>> > > > original
>> > > > > > > >> segment.
>> > > > > > > >> > We
>> > > > > > > >> > > > > lose
>> > > > > > > >> > > > > > > the
>> > > > > > > >> > > > > > > > > timestamp of all original segments except
>> the
>> > > last
>> > > > > > one.
>> > > > > > > >> As a
>> > > > > > > >> > > > > result,
>> > > > > > > >> > > > > > > > > records might not be deleted as it should
>> be
>> > > > through
>> > > > > > > time
>> > > > > > > >> > based
>> > > > > > > >> > > > > > > > retention.
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > With the current KIP proposal, if we want
>> to
>> > > > ensure
>> > > > > > > timely
>> > > > > > > >> > > > > deletion, we
>> > > > > > > >> > > > > > > > > have the following configurations:
>> > > > > > > >> > > > > > > > > 1) enable time based log compaction only :
>> > > > deletion is
>> > > > > > > >> done
>> > > > > > > >> > > > though
>> > > > > > > >> > > > > > > > > overriding the same key
>> > > > > > > >> > > > > > > > > 2) enable time based log retention only:
>> > > deletion
>> > > > is
>> > > > > > > done
>> > > > > > > >> > > though
>> > > > > > > >> > > > > > > > > time-based retention
>> > > > > > > >> > > > > > > > > 3) enable both log compaction and time
>> based
>> > > > > > retention:
>> > > > > > > >> > > Deletion
>> > > > > > > >> > > > > is not
>> > > > > > > >> > > > > > > > > guaranteed.
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > Not sure if we have use case 3 and also
>> want
>> > > > deletion
>> > > > > > to
>> > > > > > > >> > happen
>> > > > > > > >> > > > on
>> > > > > > > >> > > > > > > time.
>> > > > > > > >> > > > > > > > > There are several options to address
>> deletion
>> > > > issue
>> > > > > > when
>> > > > > > > >> > enable
>> > > > > > > >> > > > > both
>> > > > > > > >> > > > > > > > > compaction and retention:
>> > > > > > > >> > > > > > > > > A) During log compaction, looking into
>> record
>> > > > > > timestamp
>> > > > > > > to
>> > > > > > > >> > > delete
>> > > > > > > >> > > > > > > expired
>> > > > > > > >> > > > > > > > > records. This can be done in compaction
>> logic
>> > > > itself
>> > > > > > or
>> > > > > > > >> use
>> > > > > > > >> > > > > > > > > AdminClient.deleteRecords() . But this
>> assumes
>> > > we
>> > > > have
>> > > > > > > >> record
>> > > > > > > >> > > > > > > timestamp.
>> > > > > > > >> > > > > > > > > B) retain the lastModifed time of original
>> > > > segments
>> > > > > > > during
>> > > > > > > >> > log
>> > > > > > > >> > > > > > > > compaction.
>> > > > > > > >> > > > > > > > > This requires extra meta data to record the
>> > > > > > information
>> > > > > > > or
>> > > > > > > >> > not
>> > > > > > > >> > > > > grouping
>> > > > > > > >> > > > > > > > > multiple segments into one during
>> compaction.
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > If we have use case 3 in general, I would
>> > prefer
>> > > > > > > solution
>> > > > > > > >> A
>> > > > > > > >> > and
>> > > > > > > >> > > > > rely on
>> > > > > > > >> > > > > > > > > record timestamp.
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > Two questions:
>> > > > > > > >> > > > > > > > > Do we have use case 3? Is it nice to have
>> or
>> > > must
>> > > > > > have?
>> > > > > > > >> > > > > > > > > If we have use case 3 and want to go with
>> > > > solution A,
>> > > > > > > >> should
>> > > > > > > >> > we
>> > > > > > > >> > > > > > > introduce
>> > > > > > > >> > > > > > > > > a new configuration to enforce deletion by
>> > > > timestamp?
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi
>> wu <
>> > > > > > > >> > > xiongqiwu@gmail.com
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > > > wrote:
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >> Dong,
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> Thanks for the comment.
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> There are two retention policy: log
>> > compaction
>> > > > and
>> > > > > > time
>> > > > > > > >> > based
>> > > > > > > >> > > > > > > retention.
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> Log compaction:
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> we have use cases to keep infinite
>> retention
>> > > of a
>> > > > > > topic
>> > > > > > > >> > (only
>> > > > > > > >> > > > > > > > >> compaction). GDPR cares about deletion of
>> PII
>> > > > > > (personal
>> > > > > > > >> > > > > identifiable
>> > > > > > > >> > > > > > > > >> information) data.
>> > > > > > > >> > > > > > > > >> Since Kafka doesn't know what records
>> contain
>> > > > PII, it
>> > > > > > > >> relies
>> > > > > > > >> > > on
>> > > > > > > >> > > > > upper
>> > > > > > > >> > > > > > > > >> layer to delete those records.
>> > > > > > > >> > > > > > > > >> For those infinite retention uses uses,
>> kafka
>> > > > needs
>> > > > > > to
>> > > > > > > >> > > provide a
>> > > > > > > >> > > > > way
>> > > > > > > >> > > > > > > to
>> > > > > > > >> > > > > > > > >> enforce compaction on time. This is what
>> we
>> > try
>> > > > to
>> > > > > > > >> address
>> > > > > > > >> > in
>> > > > > > > >> > > > this
>> > > > > > > >> > > > > > > KIP.
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> Time based retention,
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> There are also use cases that users of
>> Kafka
>> > > > might
>> > > > > > want
>> > > > > > > >> to
>> > > > > > > >> > > > expire
>> > > > > > > >> > > > > all
>> > > > > > > >> > > > > > > > >> their data.
>> > > > > > > >> > > > > > > > >> In those cases, they can use time based
>> > > > retention of
>> > > > > > > >> their
>> > > > > > > >> > > > topics.
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> Regarding your first question, if a user
>> > wants
>> > > to
>> > > > > > > delete
>> > > > > > > >> a
>> > > > > > > >> > key
>> > > > > > > >> > > > in
>> > > > > > > >> > > > > the
>> > > > > > > >> > > > > > > > >> log compaction topic, the user has to
>> send a
>> > > > deletion
>> > > > > > > >> using
>> > > > > > > >> > > the
>> > > > > > > >> > > > > same
>> > > > > > > >> > > > > > > > key.
>> > > > > > > >> > > > > > > > >> Kafka only makes sure the deletion will
>> > happen
>> > > > under
>> > > > > > a
>> > > > > > > >> > certain
>> > > > > > > >> > > > > time
>> > > > > > > >> > > > > > > > >> periods (like 2 days/7 days).
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> Regarding your second question. In most
>> > cases,
>> > > we
>> > > > > > might
>> > > > > > > >> want
>> > > > > > > >> > > to
>> > > > > > > >> > > > > delete
>> > > > > > > >> > > > > > > > >> all duplicated keys at the same time.
>> > > > > > > >> > > > > > > > >> Compaction might be more efficient since
>> we
>> > > need
>> > > > to
>> > > > > > > scan
>> > > > > > > >> the
>> > > > > > > >> > > log
>> > > > > > > >> > > > > and
>> > > > > > > >> > > > > > > > find
>> > > > > > > >> > > > > > > > >> all duplicates. However, the expected use
>> > case
>> > > > is to
>> > > > > > > set
>> > > > > > > >> the
>> > > > > > > >> > > > time
>> > > > > > > >> > > > > > > based
>> > > > > > > >> > > > > > > > >> compaction interval on the order of days,
>> and
>> > > be
>> > > > > > larger
>> > > > > > > >> than
>> > > > > > > >> > > > 'min
>> > > > > > > >> > > > > > > > >> compaction lag". We don't want log
>> compaction
>> > > to
>> > > > > > happen
>> > > > > > > >> > > > frequently
>> > > > > > > >> > > > > > > since
>> > > > > > > >> > > > > > > > >> it is expensive. The purpose is to help
>> low
>> > > > > > production
>> > > > > > > >> rate
>> > > > > > > >> > > > topic
>> > > > > > > >> > > > > to
>> > > > > > > >> > > > > > > get
>> > > > > > > >> > > > > > > > >> compacted on time. For the topic with
>> > "normal"
>> > > > > > incoming
>> > > > > > > >> > > message
>> > > > > > > >> > > > > > > message
>> > > > > > > >> > > > > > > > >> rate, the "min dirty ratio" might have
>> > > triggered
>> > > > the
>> > > > > > > >> > > compaction
>> > > > > > > >> > > > > before
>> > > > > > > >> > > > > > > > this
>> > > > > > > >> > > > > > > > >> time based compaction policy takes effect.
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> Eno,
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> For your question, like I mentioned we
>> have
>> > > long
>> > > > time
>> > > > > > > >> > > retention
>> > > > > > > >> > > > > use
>> > > > > > > >> > > > > > > case
>> > > > > > > >> > > > > > > > >> for log compacted topic, but we want to
>> > provide
>> > > > > > ability
>> > > > > > > >> to
>> > > > > > > >> > > > delete
>> > > > > > > >> > > > > > > > certain
>> > > > > > > >> > > > > > > > >> PII records on time.
>> > > > > > > >> > > > > > > > >> Kafka itself doesn't know whether a record
>> > > > contains
>> > > > > > > >> > sensitive
>> > > > > > > >> > > > > > > > information
>> > > > > > > >> > > > > > > > >> and relies on the user for deletion.
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin
>> <
>> > > > > > > >> > > lindong28@gmail.com>
>> > > > > > > >> > > > > > > wrote:
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >>> Hey Xiongqi,
>> > > > > > > >> > > > > > > > >>>
>> > > > > > > >> > > > > > > > >>> Thanks for the KIP. I have two questions
>> > > > regarding
>> > > > > > the
>> > > > > > > >> > > use-case
>> > > > > > > >> > > > > for
>> > > > > > > >> > > > > > > > >>> meeting
>> > > > > > > >> > > > > > > > >>> GDPR requirement.
>> > > > > > > >> > > > > > > > >>>
>> > > > > > > >> > > > > > > > >>> 1) If I recall correctly, one of the GDPR
>> > > > > > requirement
>> > > > > > > is
>> > > > > > > >> > that
>> > > > > > > >> > > > we
>> > > > > > > >> > > > > can
>> > > > > > > >> > > > > > > > not
>> > > > > > > >> > > > > > > > >>> keep messages longer than e.g. 30 days in
>> > > > storage
>> > > > > > > (e.g.
>> > > > > > > >> > > Kafka).
>> > > > > > > >> > > > > Say
>> > > > > > > >> > > > > > > > there
>> > > > > > > >> > > > > > > > >>> exists a partition p0 which contains
>> > message1
>> > > > with
>> > > > > > > key1
>> > > > > > > >> and
>> > > > > > > >> > > > > message2
>> > > > > > > >> > > > > > > > with
>> > > > > > > >> > > > > > > > >>> key2. And then user keeps producing
>> messages
>> > > > with
>> > > > > > > >> key=key2
>> > > > > > > >> > to
>> > > > > > > >> > > > > this
>> > > > > > > >> > > > > > > > >>> partition. Since message1 with key1 is
>> never
>> > > > > > > overridden,
>> > > > > > > >> > > sooner
>> > > > > > > >> > > > > or
>> > > > > > > >> > > > > > > > later
>> > > > > > > >> > > > > > > > >>> we
>> > > > > > > >> > > > > > > > >>> will want to delete message1 and keep the
>> > > latest
>> > > > > > > message
>> > > > > > > >> > with
>> > > > > > > >> > > > > > > key=key2.
>> > > > > > > >> > > > > > > > >>> But
>> > > > > > > >> > > > > > > > >>> currently it looks like log compact
>> logic in
>> > > > Kafka
>> > > > > > > will
>> > > > > > > >> > > always
>> > > > > > > >> > > > > put
>> > > > > > > >> > > > > > > > these
>> > > > > > > >> > > > > > > > >>> messages in the same segment. Will this
>> be
>> > an
>> > > > issue?
>> > > > > > > >> > > > > > > > >>>
>> > > > > > > >> > > > > > > > >>> 2) The current KIP intends to provide the
>> > > > capability
>> > > > > > > to
>> > > > > > > >> > > delete
>> > > > > > > >> > > > a
>> > > > > > > >> > > > > > > given
>> > > > > > > >> > > > > > > > >>> message in log compacted topic. Does such
>> > > > use-case
>> > > > > > > also
>> > > > > > > >> > > require
>> > > > > > > >> > > > > Kafka
>> > > > > > > >> > > > > > > > to
>> > > > > > > >> > > > > > > > >>> keep the messages produced before the
>> given
>> > > > message?
>> > > > > > > If
>> > > > > > > >> > yes,
>> > > > > > > >> > > > > then we
>> > > > > > > >> > > > > > > > can
>> > > > > > > >> > > > > > > > >>> probably just use
>> > AdminClient.deleteRecords()
>> > > or
>> > > > > > > >> time-based
>> > > > > > > >> > > log
>> > > > > > > >> > > > > > > > retention
>> > > > > > > >> > > > > > > > >>> to meet the use-case requirement. If no,
>> do
>> > > you
>> > > > know
>> > > > > > > >> what
>> > > > > > > >> > is
>> > > > > > > >> > > > the
>> > > > > > > >> > > > > > > GDPR's
>> > > > > > > >> > > > > > > > >>> requirement on time-to-deletion after
>> user
>> > > > > > explicitly
>> > > > > > > >> > > requests
>> > > > > > > >> > > > > the
>> > > > > > > >> > > > > > > > >>> deletion
>> > > > > > > >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
>> > > > > > > >> > > > > > > > >>>
>> > > > > > > >> > > > > > > > >>> Thanks,
>> > > > > > > >> > > > > > > > >>> Dong
>> > > > > > > >> > > > > > > > >>>
>> > > > > > > >> > > > > > > > >>>
>> > > > > > > >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi
>> wu
>> > <
>> > > > > > > >> > > > xiongqiwu@gmail.com
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > > > wrote:
>> > > > > > > >> > > > > > > > >>>
>> > > > > > > >> > > > > > > > >>> > Hi Eno,
>> > > > > > > >> > > > > > > > >>> >
>> > > > > > > >> > > > > > > > >>> > The GDPR request we are getting here at
>> > > > linkedin
>> > > > > > is
>> > > > > > > >> if we
>> > > > > > > >> > > > get a
>> > > > > > > >> > > > > > > > >>> request to
>> > > > > > > >> > > > > > > > >>> > delete a record through a null key on a
>> > log
>> > > > > > > compacted
>> > > > > > > >> > > topic,
>> > > > > > > >> > > > > > > > >>> > we want to delete the record via
>> > compaction
>> > > > in a
>> > > > > > > given
>> > > > > > > >> > time
>> > > > > > > >> > > > > period
>> > > > > > > >> > > > > > > > >>> like 2
>> > > > > > > >> > > > > > > > >>> > days (whatever is required by the
>> policy).
>> > > > > > > >> > > > > > > > >>> >
>> > > > > > > >> > > > > > > > >>> > There might be other issues (such as
>> > orphan
>> > > > log
>> > > > > > > >> segments
>> > > > > > > >> > > > under
>> > > > > > > >> > > > > > > > certain
>> > > > > > > >> > > > > > > > >>> > conditions) that lead to GDPR problem
>> but
>> > > > they are
>> > > > > > > >> more
>> > > > > > > >> > > like
>> > > > > > > >> > > > > > > > >>> something we
>> > > > > > > >> > > > > > > > >>> > need to fix anyway regardless of GDPR.
>> > > > > > > >> > > > > > > > >>> >
>> > > > > > > >> > > > > > > > >>> >
>> > > > > > > >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
>> > > > > > > >> > > > > > > > >>> >
>> > > > > > > >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno
>> > > Thereska
>> > > > <
>> > > > > > > >> > > > > > > > eno.thereska@gmail.com>
>> > > > > > > >> > > > > > > > >>> > wrote:
>> > > > > > > >> > > > > > > > >>> >
>> > > > > > > >> > > > > > > > >>> > > Hello,
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > > Thanks for the KIP. I'd like to see a
>> > more
>> > > > > > precise
>> > > > > > > >> > > > > definition of
>> > > > > > > >> > > > > > > > what
>> > > > > > > >> > > > > > > > >>> > part
>> > > > > > > >> > > > > > > > >>> > > of GDPR you are targeting as well as
>> > some
>> > > > sort
>> > > > > > of
>> > > > > > > >> > > > > verification
>> > > > > > > >> > > > > > > that
>> > > > > > > >> > > > > > > > >>> this
>> > > > > > > >> > > > > > > > >>> > > KIP actually addresses the problem.
>> > Right
>> > > > now I
>> > > > > > > find
>> > > > > > > >> > > this a
>> > > > > > > >> > > > > bit
>> > > > > > > >> > > > > > > > >>> vague:
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > > "Ability to delete a log message
>> through
>> > > > > > > compaction
>> > > > > > > >> in
>> > > > > > > >> > a
>> > > > > > > >> > > > > timely
>> > > > > > > >> > > > > > > > >>> manner
>> > > > > > > >> > > > > > > > >>> > has
>> > > > > > > >> > > > > > > > >>> > > become an important requirement in
>> some
>> > > use
>> > > > > > cases
>> > > > > > > >> > (e.g.,
>> > > > > > > >> > > > > GDPR)"
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > > Is there any guarantee that after
>> this
>> > KIP
>> > > > the
>> > > > > > > GDPR
>> > > > > > > >> > > problem
>> > > > > > > >> > > > > is
>> > > > > > > >> > > > > > > > >>> solved or
>> > > > > > > >> > > > > > > > >>> > do
>> > > > > > > >> > > > > > > > >>> > > we need to do something else as well,
>> > > e.g.,
>> > > > more
>> > > > > > > >> KIPs?
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > > Thanks
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > > Eno
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM,
>> xiongqi
>> > > wu <
>> > > > > > > >> > > > > xiongqiwu@gmail.com>
>> > > > > > > >> > > > > > > > >>> wrote:
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> > > > Hi Kafka,
>> > > > > > > >> > > > > > > > >>> > > >
>> > > > > > > >> > > > > > > > >>> > > > This KIP tries to address GDPR
>> concern
>> > > to
>> > > > > > > fulfill
>> > > > > > > >> > > > deletion
>> > > > > > > >> > > > > > > > request
>> > > > > > > >> > > > > > > > >>> on
>> > > > > > > >> > > > > > > > >>> > > time
>> > > > > > > >> > > > > > > > >>> > > > through time-based log compaction
>> on a
>> > > > > > > compaction
>> > > > > > > >> > > enabled
>> > > > > > > >> > > > > > > topic:
>> > > > > > > >> > > > > > > > >>> > > >
>> > > > > > > >> > > > > > > > >>> > > >
>> > > > > > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
>> > > > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
>> > > > > > > >> > > > > > > > <
>> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
>> > > > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>>
>> > > > > > > >> > > > > > > > >>> > > >
>> > 354%3A+Time-based+log+compaction+policy
>> > > > > > > >> > > > > > > > >>> > > >
>> > > > > > > >> > > > > > > > >>> > > > Any feedback will be appreciated.
>> > > > > > > >> > > > > > > > >>> > > >
>> > > > > > > >> > > > > > > > >>> > > >
>> > > > > > > >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
>> > > > > > > >> > > > > > > > >>> > > >
>> > > > > > > >> > > > > > > > >>> > >
>> > > > > > > >> > > > > > > > >>> >
>> > > > > > > >> > > > > > > > >>>
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >> --
>> > > > > > > >> > > > > > > > >> Xiongqi (Wesley) Wu
>> > > > > > > >> > > > > > > > >>
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > --
>> > > > > > > >> > > > > > > > > Xiongqi (Wesley) Wu
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > --
>> > > > > > > >> > > > > > > > Xiongqi (Wesley) Wu
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > --
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > Brett Rann
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > Senior DevOps Engineer
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > Zendesk International Ltd
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > 395 Collins Street, Melbourne VIC 3000
>> Australia
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > Mobile: +61 (0) 418 826 017
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > --
>> > > > > > > >> > > > > > Xiongqi (Wesley) Wu
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > > --
>> > > > > > > >> > > > Xiongqi (Wesley) Wu
>> > > > > > > >> > > >
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > --
>> > > > > > > >> > Xiongqi (Wesley) Wu
>> > > > > > > >> >
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >> --
>> > > > > > > >>
>> > > > > > > >> Brett Rann
>> > > > > > > >>
>> > > > > > > >> Senior DevOps Engineer
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >> Zendesk International Ltd
>> > > > > > > >>
>> > > > > > > >> 395 Collins Street, Melbourne VIC 3000 Australia
>> > > > > > > >>
>> > > > > > > >> Mobile: +61 (0) 418 826 017
>> > > > > > > >>
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > Xiongqi (Wesley) Wu
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > >
>> > > > > > Brett Rann
>> > > > > >
>> > > > > > Senior DevOps Engineer
>> > > > > >
>> > > > > >
>> > > > > > Zendesk International Ltd
>> > > > > >
>> > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
>> > > > > >
>> > > > > > Mobile: +61 (0) 418 826 017
>> > > > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > -Regards,
>> > > Mayuresh R. Gharat
>> > > (862) 250-7125
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-354 Time-based log compaction policy

Posted by xiongqi wu <xi...@gmail.com>.
Hi Dong,

Thank you for your comment.  See my inline comments.
I will update the KIP shortly.

Xiongqi (Wesley) Wu


On Sun, Oct 28, 2018 at 9:17 PM Dong Lin <li...@gmail.com> wrote:

> Hey Xiongqi,
>
> Sorry for late reply. I have some comments below:
>
> 1) As discussed earlier in the email list, if the topic is configured with
> both deletion and compaction, in some cases messages produced a long time
> ago can not be deleted based on time. This is a valid use-case because we
> actually have topic which is configured with both deletion and compaction
> policy. And we should enforce the semantics for both policy. Solution A
> sounds good. We do not need interface change (e.g. extra config) to enforce
> solution A. All we need is to update implementation so that when broker
> compacts a topic, if the message has timestamp (which is the common case),
> messages that are too old (based on the time-based retention config) will
> be discarded. Since this is a valid issue and it is also related to the
> guarantee of when a message can be deleted, can we include the solution of
> this problem in the KIP?
>
======  This makes sense.  We can use similar approach to increase the log
start offset.

>
> 2) It is probably OK to assume that all messages have timestamp. The
> per-message timestamp was introduced into Kafka 0.10.0 with KIP-31 and
> KIP-32 as of Feb 2016. Kafka 0.10.0 or earlier versions are no longer
> supported. Also, since the use-case for this feature is primarily for GDPR,
> we can assume that client library has already been upgraded to support SSL,
> which feature is added after KIP-31 and KIP-32.
>
>  =========>  Ok. We can use message timestamp to delete expired records if
both compaction and retention are enabled.


3) In Proposed Change section 2.a, it is said that segment.largestTimestamp
> - maxSegmentMs can be used to determine the timestamp of the earliest
> message. Would it be simpler to just use the create time of the file to
> determine the time?
>
> ========>  Linux/Java doesn't provide API for file creation time because
some filesystem type doesn't provide file creation time.


> 4) The KIP suggests to use must-clean-ratio to select the partition to be
> compacted. Unlike dirty ratio which is mostly for performance, the logs
> whose "must-clean-ratio" is non-zero must be compacted immediately for
> correctness reason (and for GDPR). And if this can no be achieved because
> e.g. broker compaction throughput is too low, investigation will be needed.
> So it seems simpler to first compact logs which has segment whose earliest
> timetamp is earlier than now - max.compaction.lag.ms, instead of defining
> must-clean-ratio and sorting logs based on this value.
>
>
======>  Good suggestion. This can simply the implementation quite a bit if
we are not too concerned about compaction of GDPR required partition queued
behind some large partition.  The actual compaction completion time is not
guaranteed anyway.


> 5) The KIP says max.compaction.lag.ms is 0 by default and it is also
> suggested that 0 means disable. Should we set this value to MAX_LONG by
> default to effectively disable the feature added in this KIP?
>
> ====> I would rather use 0 so the corresponding code path will not be
exercised.  By using MAX_LONG, we would theoretically go through related
code to find out whether the partition is required to be compacted to
satisfy MAX_LONG.

6) It is probably cleaner and readable not to include in Public Interface
> section those configs whose meaning is not changed.
>
> ====> I will clean that up.

7) The goal of this KIP is to ensure that log segment whose earliest
> message is earlier than a given threshold will be compacted. This goal may
> not be achieved if the compact throughput can not catchup with the total
> bytes-in-rate for the compacted topics on the broker. Thus we need an easy
> way to tell operator whether this goal is achieved. If we don't already
> have such metric, maybe we can include metrics to show 1) the total number
> of log segments (or logs) which needs to be immediately compacted as
> determined by max.compaction.lag; and 2) the maximum value of now -
> earliest_time_stamp_of_segment among all segments that needs to be
> compacted.
>
> =======> good suggestion.  I will update KIP for these metrics.

8) The Performance Impact suggests user to use the existing metrics to
> monitor the performance impact of this KIP. It i useful to list mean of
> each jmx metrics that we want user to monitor, and possibly explain how to
> interpret the value of these metrics to determine whether there is
> performance issue.
>
> =========>  I will update the KIP.

> Thanks,
> Dong
>
> On Tue, Oct 16, 2018 at 10:53 AM xiongqi wu <xi...@gmail.com> wrote:
>
> > Mayuresh,
> >
> > Thanks for the comments.
> > The requirement is that we need to pick up segments that are older than
> > maxCompactionLagMs for compaction.
> > maxCompactionLagMs is an upper-bound, which implies that picking up
> > segments for compaction earlier doesn't violated the policy.
> > We use the creation time of a segment as an estimation of its records
> > arrival time, so these records can be compacted no later than
> > maxCompactionLagMs.
> >
> > On the other hand, compaction is an expensive operation, we don't want to
> > compact the log partition whenever a new segment is sealed.
> > Therefore, we want to pick up a segment for compaction when the segment
> is
> > closed to mandatory max compaction lag (so we use segment creation time
> as
> > an estimation.)
> >
> >
> > Xiongqi (Wesley) Wu
> >
> >
> > On Mon, Oct 15, 2018 at 5:54 PM Mayuresh Gharat <
> > gharatmayuresh15@gmail.com>
> > wrote:
> >
> > > Hi Wesley,
> > >
> > > Thanks for the KIP and sorry for being late to the party.
> > >  I wanted to understand, the scenario you mentioned in Proposed
> changes :
> > >
> > > -
> > > >
> > > > Estimate the earliest message timestamp of an un-compacted log
> segment.
> > > we
> > > > only need to estimate earliest message timestamp for un-compacted log
> > > > segments to ensure timely compaction because the deletion requests
> that
> > > > belong to compacted segments have already been processed.
> > > >
> > > >    1.
> > > >
> > > >    for the first (earliest) log segment:  The estimated earliest
> > > >    timestamp is set to the timestamp of the first message if
> timestamp
> > is
> > > >    present in the message. Otherwise, the estimated earliest
> timestamp
> > > is set
> > > >    to "segment.largestTimestamp - maxSegmentMs”
> > > >     (segment.largestTimestamp is lastModified time of the log segment
> > or
> > > max
> > > >    timestamp we see for the log segment.). In the later case, the
> > actual
> > > >    timestamp of the first message might be later than the estimation,
> > > but it
> > > >    is safe to pick up the log for compaction earlier.
> > > >
> > > > When we say "actual timestamp of the first message might be later
> than
> > > the
> > > estimation, but it is safe to pick up the log for compaction earlier.",
> > > doesn't that violate the assumption that we will consider a segment for
> > > compaction only if the time of creation the segment has crossed the
> "now
> > -
> > > maxCompactionLagMs" ?
> > >
> > > Thanks,
> > >
> > > Mayuresh
> > >
> > > On Mon, Sep 3, 2018 at 7:28 PM Brett Rann <br...@zendesk.com.invalid>
> > > wrote:
> > >
> > > > Might also be worth moving to a vote thread? Discussion seems to have
> > > gone
> > > > as far as it can.
> > > >
> > > > > On 4 Sep 2018, at 12:08, xiongqi wu <xi...@gmail.com> wrote:
> > > > >
> > > > > Brett,
> > > > >
> > > > > Yes, I will post PR tomorrow.
> > > > >
> > > > > Xiongqi (Wesley) Wu
> > > > >
> > > > >
> > > > > On Sun, Sep 2, 2018 at 6:28 PM Brett Rann
> <brann@zendesk.com.invalid
> > >
> > > > wrote:
> > > > >
> > > > > > +1 (non-binding) from me on the interface. I'd like to see
> someone
> > > > familiar
> > > > > > with
> > > > > > the code comment on the approach, and note there's a couple of
> > > > different
> > > > > > approaches: what's documented in the KIP, and what Xiaohe Dong
> was
> > > > working
> > > > > > on
> > > > > > here:
> > > > > >
> > > > > >
> > > >
> > >
> >
> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0
> > > > > >
> > > > > > If you have code working already Xiongqi Wu could you share a PR?
> > I'd
> > > > be
> > > > > > happy
> > > > > > to start testing.
> > > > > >
> > > > > > On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <xi...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > Do you have any additional comments on this KIP?
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <
> xiongqiwu@gmail.com
> > >
> > > > wrote:
> > > > > > >
> > > > > > > > on 2)
> > > > > > > > The offsetmap is built starting from dirty segment.
> > > > > > > > The compaction starts from the beginning of the log
> partition.
> > > > That's
> > > > > > how
> > > > > > > > it ensure the deletion of tomb keys.
> > > > > > > > I will double check tomorrow.
> > > > > > > >
> > > > > > > > Xiongqi (Wesley) Wu
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann
> > > > <br...@zendesk.com.invalid>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> To just clarify a bit on 1. whether there's an external
> > > storage/DB
> > > > > > isn't
> > > > > > > >> relevant here.
> > > > > > > >> Compacted topics allow a tombstone record to be sent (a null
> > > value
> > > > > > for a
> > > > > > > >> key) which
> > > > > > > >> currently will result in old values for that key being
> deleted
> > > if
> > > > some
> > > > > > > >> conditions are met.
> > > > > > > >> There are existing controls to make sure the old values will
> > > stay
> > > > > > around
> > > > > > > >> for a minimum
> > > > > > > >> time at least, but no dedicated control to ensure the
> > tombstone
> > > > will
> > > > > > > >> delete
> > > > > > > >> within a
> > > > > > > >> maximum time.
> > > > > > > >>
> > > > > > > >> One popular reason that maximum time for deletion is
> desirable
> > > > right
> > > > > > now
> > > > > > > >> is
> > > > > > > >> GDPR with
> > > > > > > >> PII. But we're not proposing any GDPR awareness in kafka,
> just
> > > > being
> > > > > > > able
> > > > > > > >> to guarantee
> > > > > > > >> a max time where a tombstoned key will be removed from the
> > > > compacted
> > > > > > > >> topic.
> > > > > > > >>
> > > > > > > >> on 2)
> > > > > > > >> huh, i thought it kept track of the first dirty segment and
> > > didn't
> > > > > > > >> recompact older "clean" ones.
> > > > > > > >> But I didn't look at code or test for that.
> > > > > > > >>
> > > > > > > >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <
> > > xiongqiwu@gmail.com>
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> > 1, Owner of data (in this sense, kafka is the not the
> owner
> > of
> > > > data)
> > > > > > > >> > should keep track of lifecycle of the data in some
> external
> > > > > > > storage/DB.
> > > > > > > >> > The owner determines when to delete the data and send the
> > > delete
> > > > > > > >> request to
> > > > > > > >> > kafka. Kafka doesn't know about the content of data but to
> > > > provide a
> > > > > > > >> mean
> > > > > > > >> > for deletion.
> > > > > > > >> >
> > > > > > > >> > 2 , each time compaction runs, it will start from first
> > > > segments (no
> > > > > > > >> > matter if it is compacted or not). The time estimation
> here
> > is
> > > > only
> > > > > > > used
> > > > > > > >> > to determine whether we should run compaction on this log
> > > > partition.
> > > > > > > So
> > > > > > > >> we
> > > > > > > >> > only need to estimate uncompacted segments.
> > > > > > > >> >
> > > > > > > >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <
> > > lindong28@gmail.com>
> > > > > > > wrote:
> > > > > > > >> >
> > > > > > > >> > > Hey Xiongqi,
> > > > > > > >> > >
> > > > > > > >> > > Thanks for the update. I have two questions for the
> latest
> > > > KIP.
> > > > > > > >> > >
> > > > > > > >> > > 1) The motivation section says that one use case is to
> > > delete
> > > > PII
> > > > > > > >> > (Personal
> > > > > > > >> > > Identifiable information) data within 7 days while
> keeping
> > > > non-PII
> > > > > > > >> > > indefinitely in compacted format. I suppose the use-case
> > > > depends
> > > > > > on
> > > > > > > >> the
> > > > > > > >> > > application to determine when to delete those PII data.
> > > Could
> > > > you
> > > > > > > >> explain
> > > > > > > >> > > how can application reliably determine the set of keys
> > that
> > > > should
> > > > > > > be
> > > > > > > >> > > deleted? Is application required to always messages from
> > the
> > > > topic
> > > > > > > >> after
> > > > > > > >> > > every restart and determine the keys to be deleted by
> > > looking
> > > > at
> > > > > > > >> message
> > > > > > > >> > > timestamp, or is application supposed to persist the
> key->
> > > > > > timstamp
> > > > > > > >> > > information in a separate persistent storage system?
> > > > > > > >> > >
> > > > > > > >> > > 2) It is mentioned in the KIP that "we only need to
> > estimate
> > > > > > > earliest
> > > > > > > >> > > message timestamp for un-compacted log segments because
> > the
> > > > > > deletion
> > > > > > > >> > > requests that belong to compacted segments have already
> > been
> > > > > > > >> processed".
> > > > > > > >> > > Not sure if it is correct. If a segment is compacted
> > before
> > > > user
> > > > > > > sends
> > > > > > > >> > > message to delete a key in this segment, it seems that
> we
> > > > still
> > > > > > need
> > > > > > > >> to
> > > > > > > >> > > ensure that the segment will be compacted again within
> the
> > > > given
> > > > > > > time
> > > > > > > >> > after
> > > > > > > >> > > the deletion is requested, right?
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Dong
> > > > > > > >> > >
> > > > > > > >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <
> > > > xiongqiwu@gmail.com
> > > > > > >
> > > > > > > >> > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Hi Xiaohe,
> > > > > > > >> > > >
> > > > > > > >> > > > Quick note:
> > > > > > > >> > > > 1) Use minimum of segment.ms and
> max.compaction.lag.ms
> > > > > > > >> > > > <http://max.compaction.ms
> > > > > > > <http://max.compaction.ms>
> > > > > > > >> > <http://max.compaction.ms
> > > > > > > <http://max.compaction.ms>>>
> > > > > > > >> > > >
> > > > > > > >> > > > 2) I am not sure if I get your second question. first,
> > we
> > > > have
> > > > > > > >> jitter
> > > > > > > >> > > when
> > > > > > > >> > > > we roll the active segment. second, on each
> compaction,
> > we
> > > > > > compact
> > > > > > > >> upto
> > > > > > > >> > > > the offsetmap could allow. Those will not lead to
> > perfect
> > > > > > > compaction
> > > > > > > >> > > storm
> > > > > > > >> > > > overtime. In addition, I expect we are setting
> > > > > > > >> max.compaction.lag.ms
> > > > > > > >> > on
> > > > > > > >> > > > the order of days.
> > > > > > > >> > > >
> > > > > > > >> > > > 3) I don't have access to the confluent community
> slack
> > > for
> > > > > > now. I
> > > > > > > >> am
> > > > > > > >> > > > reachable via the google handle out.
> > > > > > > >> > > > To avoid the double effort, here is my plan:
> > > > > > > >> > > > a) Collect more feedback and feature requriement on
> the
> > > KIP.
> > > > > > > >> > > > b) Wait unitl this KIP is approved.
> > > > > > > >> > > > c) I will address any additional requirements in the
> > > > > > > implementation.
> > > > > > > >> > (My
> > > > > > > >> > > > current implementation only complies to whatever
> > described
> > > > in
> > > > > > the
> > > > > > > >> KIP
> > > > > > > >> > > now)
> > > > > > > >> > > > d) I can share the code with the you and community see
> > you
> > > > want
> > > > > > to
> > > > > > > >> add
> > > > > > > >> > > > anything.
> > > > > > > >> > > > e) submission through committee
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
> > > > > > > >> dannyrivclo@gmail.com>
> > > > > > > >> > > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi Xiongqi
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks for thinking about implementing this as well.
> > :)
> > > > > > > >> > > > >
> > > > > > > >> > > > > I was thinking about using `segment.ms` to trigger
> > the
> > > > > > segment
> > > > > > > >> roll.
> > > > > > > >> > > > > Also, its value can be the largest time bias for the
> > > > record
> > > > > > > >> deletion.
> > > > > > > >> > > For
> > > > > > > >> > > > > example, if the `segment.ms` is 1 day and `
> > > > max.compaction.ms`
> > > > > > > is
> > > > > > > >> 30
> > > > > > > >> > > > days,
> > > > > > > >> > > > > the compaction may happen around 31 days.
> > > > > > > >> > > > >
> > > > > > > >> > > > > For my curiosity, is there a way we can do some
> > > > performance
> > > > > > test
> > > > > > > >> for
> > > > > > > >> > > this
> > > > > > > >> > > > > and any tools you can recommend. As you know,
> > > previously,
> > > > it
> > > > > > is
> > > > > > > >> > cleaned
> > > > > > > >> > > > up
> > > > > > > >> > > > > by respecting dirty ratio, but now it may happen
> > anytime
> > > > if
> > > > > > max
> > > > > > > >> lag
> > > > > > > >> > has
> > > > > > > >> > > > > passed for each message. I wonder what would happen
> if
> > > > clients
> > > > > > > >> send
> > > > > > > >> > > huge
> > > > > > > >> > > > > amount of tombstone records at the same time.
> > > > > > > >> > > > >
> > > > > > > >> > > > > I am looking forward to have a quick chat with you
> to
> > > > avoid
> > > > > > > double
> > > > > > > >> > > effort
> > > > > > > >> > > > > on this. I am in confluent community slack during
> the
> > > work
> > > > > > time.
> > > > > > > >> My
> > > > > > > >> > > name
> > > > > > > >> > > > is
> > > > > > > >> > > > > Xiaohe Dong. :)
> > > > > > > >> > > > >
> > > > > > > >> > > > > Rgds
> > > > > > > >> > > > > Xiaohe Dong
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > > On 2018/08/16 01:22:22, xiongqi wu <
> > xiongqiwu@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >> > > > > > Brett,
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > Thank you for your comments.
> > > > > > > >> > > > > > I was thinking since we already has immediate
> > > compaction
> > > > > > > >> setting by
> > > > > > > >> > > > > setting
> > > > > > > >> > > > > > min dirty ratio to 0, so I decide to use "0" as
> > > disabled
> > > > > > > state.
> > > > > > > >> > > > > > I am ok to go with -1(disable), 0 (immediate)
> > options.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > For the implementation, there are a few
> differences
> > > > between
> > > > > > > mine
> > > > > > > >> > and
> > > > > > > >> > > > > > "Xiaohe Dong"'s :
> > > > > > > >> > > > > > 1) I used the estimated creation time of a log
> > segment
> > > > > > instead
> > > > > > > >> of
> > > > > > > >> > > > largest
> > > > > > > >> > > > > > timestamp of a log to determine the compaction
> > > > eligibility,
> > > > > > > >> > because a
> > > > > > > >> > > > log
> > > > > > > >> > > > > > segment might stay as an active segment up to "max
> > > > > > compaction
> > > > > > > >> lag".
> > > > > > > >> > > > (see
> > > > > > > >> > > > > > the KIP for detail).
> > > > > > > >> > > > > > 2) I measure how much bytes that we must clean to
> > > > follow the
> > > > > > > >> "max
> > > > > > > >> > > > > > compaction lag" rule, and use that to determine
> the
> > > > order of
> > > > > > > >> > > > compaction.
> > > > > > > >> > > > > > 3) force active segment to roll to follow the "max
> > > > > > compaction
> > > > > > > >> lag"
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > I can share my code so we can coordinate.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > I haven't think about a new API to force a
> > compaction.
> > > > what
> > > > > > is
> > > > > > > >> the
> > > > > > > >> > > use
> > > > > > > >> > > > > case
> > > > > > > >> > > > > > for this one?
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> > > > > > > >> > > <brann@zendesk.com.invalid
> > > > > > > >> > > > >
> > > > > > > >> > > > > > wrote:
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > > We've been looking into this too.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Mailing list:
> > > > > > > >> > > > > > > https://lists.apache.org/thread.html/
> > > > > > > <https://lists.apache.org/thread.html/>
> > > > > > > >> > <https://lists.apache.org/thread.html/
> > > > > > > <https://lists.apache.org/thread.html/>>
> > > > > > > >> > > ed7f6a6589f94e8c2a705553f364ef
> > > > > > > >> > > > > > > 599cb6915e4c3ba9b561e610e4@%
> > 3Cdev.kafka.apache.org
> > > %3E
> > > > > > > >> > > > > > > jira wish:
> > > > > > https://issues.apache.org/jira/browse/KAFKA-7137
> > > > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>
> > > > > > > >> > <https://issues.apache.org/jira/browse/KAFKA-7137
> > > > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>>
> > > > > > > >> > > > > > > confluent slack discussion:
> > > > > > > >> > > > > > >
> > > > https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>
> > > > > > > >> > <https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>>
> > > > > > > >> > > > > p1530760121000039
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > A person on my team has started on code so you
> > might
> > > > want
> > > > > > to
> > > > > > > >> > > > > coordinate:
> > > > > > > >> > > > > > >
> > > > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
> > > > > > > >> > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->>
> > > > > > > >> > > > > > > cleaner-compaction-max-lifetime-2.0
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > He's been working with Jason Gustafson and James
> > > Chen
> > > > > > around
> > > > > > > >> the
> > > > > > > >> > > > > changes.
> > > > > > > >> > > > > > > You can ping him on confluent slack as Xiaohe
> > Dong.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > It's great to know others are thinking on it as
> > > well.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > You've added the requirement to force a segment
> > roll
> > > > which
> > > > > > > we
> > > > > > > >> > > hadn't
> > > > > > > >> > > > > gotten
> > > > > > > >> > > > > > > to yet, which is great. I was content with it
> not
> > > > > > including
> > > > > > > >> the
> > > > > > > >> > > > active
> > > > > > > >> > > > > > > segment.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > Adding topic level configuration "
> > > > max.compaction.lag.ms
> > > > > > ",
> > > > > > > >> and
> > > > > > > >> > > > > > > corresponding broker configuration "
> > > > > > > >> > log.cleaner.max.compaction.la
> > > > > > > >> > > > g.ms
> > > > > > > >> > > > > ",
> > > > > > > >> > > > > > > which is set to 0 (disabled) by default.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Glancing at some other settings convention seems
> > to
> > > > me to
> > > > > > be
> > > > > > > >> -1
> > > > > > > >> > for
> > > > > > > >> > > > > > > disabled (or infinite, which is more meaningful
> > > > here). 0
> > > > > > to
> > > > > > > me
> > > > > > > >> > > > implies
> > > > > > > >> > > > > > > instant, a little quicker than 1.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > We've been trying to think about a way to
> trigger
> > > > > > compaction
> > > > > > > >> as
> > > > > > > >> > > well
> > > > > > > >> > > > > > > through an API call, which would need to be
> > flagged
> > > > > > > somewhere
> > > > > > > >> (ZK
> > > > > > > >> > > > > admin/
> > > > > > > >> > > > > > > space?) but we're struggling to think how that
> > would
> > > > be
> > > > > > > >> > coordinated
> > > > > > > >> > > > > across
> > > > > > > >> > > > > > > brokers and partitions. Have you given any
> thought
> > > to
> > > > > > that?
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <
> > > > > > > >> xiongqiwu@gmail.com>
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > Eno, Dong,
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > I have updated the KIP. We decide not to
> address
> > > the
> > > > > > issue
> > > > > > > >> that
> > > > > > > >> > > we
> > > > > > > >> > > > > might
> > > > > > > >> > > > > > > > have for both compaction and time retention
> > > enabled
> > > > > > topics
> > > > > > > >> (see
> > > > > > > >> > > the
> > > > > > > >> > > > > > > > rejected alternative item 2). This KIP will
> only
> > > > ensure
> > > > > > > log
> > > > > > > >> can
> > > > > > > >> > > be
> > > > > > > >> > > > > > > > compacted after a specified time-interval.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > As suggested by Dong, we will also enforce "
> > > > > > > >> > > max.compaction.lag.ms"
> > > > > > > >> > > > > is
> > > > > > > >> > > > > > > not
> > > > > > > >> > > > > > > > less than "min.compaction.lag.ms".
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > > > > >> > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > > > > > >> > > > > Time-based
> > > > > > > >> > > > > > > log
> > > > > > > >> > > > > > > > compaction policy
> > > > > > > >> > > > > > > > <
> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > > > > >> > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > > > > > >> > > > > Time-based
> > > > > > > >> > > > > > > log compaction policy>
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu <
> > > > > > > >> > xiongqiwu@gmail.com
> > > > > > > >> > > >
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Per discussion with Dong, he made a very
> good
> > > > point
> > > > > > that
> > > > > > > >> if
> > > > > > > >> > > > > compaction
> > > > > > > >> > > > > > > > > and time based retention are both enabled
> on a
> > > > topic,
> > > > > > > the
> > > > > > > >> > > > > compaction
> > > > > > > >> > > > > > > > might
> > > > > > > >> > > > > > > > > prevent records from being deleted on time.
> > The
> > > > reason
> > > > > > > is
> > > > > > > >> > when
> > > > > > > >> > > > > > > compacting
> > > > > > > >> > > > > > > > > multiple segments into one single segment,
> the
> > > > newly
> > > > > > > >> created
> > > > > > > >> > > > > segment
> > > > > > > >> > > > > > > will
> > > > > > > >> > > > > > > > > have same lastmodified timestamp as latest
> > > > original
> > > > > > > >> segment.
> > > > > > > >> > We
> > > > > > > >> > > > > lose
> > > > > > > >> > > > > > > the
> > > > > > > >> > > > > > > > > timestamp of all original segments except
> the
> > > last
> > > > > > one.
> > > > > > > >> As a
> > > > > > > >> > > > > result,
> > > > > > > >> > > > > > > > > records might not be deleted as it should be
> > > > through
> > > > > > > time
> > > > > > > >> > based
> > > > > > > >> > > > > > > > retention.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > With the current KIP proposal, if we want to
> > > > ensure
> > > > > > > timely
> > > > > > > >> > > > > deletion, we
> > > > > > > >> > > > > > > > > have the following configurations:
> > > > > > > >> > > > > > > > > 1) enable time based log compaction only :
> > > > deletion is
> > > > > > > >> done
> > > > > > > >> > > > though
> > > > > > > >> > > > > > > > > overriding the same key
> > > > > > > >> > > > > > > > > 2) enable time based log retention only:
> > > deletion
> > > > is
> > > > > > > done
> > > > > > > >> > > though
> > > > > > > >> > > > > > > > > time-based retention
> > > > > > > >> > > > > > > > > 3) enable both log compaction and time based
> > > > > > retention:
> > > > > > > >> > > Deletion
> > > > > > > >> > > > > is not
> > > > > > > >> > > > > > > > > guaranteed.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Not sure if we have use case 3 and also want
> > > > deletion
> > > > > > to
> > > > > > > >> > happen
> > > > > > > >> > > > on
> > > > > > > >> > > > > > > time.
> > > > > > > >> > > > > > > > > There are several options to address
> deletion
> > > > issue
> > > > > > when
> > > > > > > >> > enable
> > > > > > > >> > > > > both
> > > > > > > >> > > > > > > > > compaction and retention:
> > > > > > > >> > > > > > > > > A) During log compaction, looking into
> record
> > > > > > timestamp
> > > > > > > to
> > > > > > > >> > > delete
> > > > > > > >> > > > > > > expired
> > > > > > > >> > > > > > > > > records. This can be done in compaction
> logic
> > > > itself
> > > > > > or
> > > > > > > >> use
> > > > > > > >> > > > > > > > > AdminClient.deleteRecords() . But this
> assumes
> > > we
> > > > have
> > > > > > > >> record
> > > > > > > >> > > > > > > timestamp.
> > > > > > > >> > > > > > > > > B) retain the lastModifed time of original
> > > > segments
> > > > > > > during
> > > > > > > >> > log
> > > > > > > >> > > > > > > > compaction.
> > > > > > > >> > > > > > > > > This requires extra meta data to record the
> > > > > > information
> > > > > > > or
> > > > > > > >> > not
> > > > > > > >> > > > > grouping
> > > > > > > >> > > > > > > > > multiple segments into one during
> compaction.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > If we have use case 3 in general, I would
> > prefer
> > > > > > > solution
> > > > > > > >> A
> > > > > > > >> > and
> > > > > > > >> > > > > rely on
> > > > > > > >> > > > > > > > > record timestamp.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Two questions:
> > > > > > > >> > > > > > > > > Do we have use case 3? Is it nice to have or
> > > must
> > > > > > have?
> > > > > > > >> > > > > > > > > If we have use case 3 and want to go with
> > > > solution A,
> > > > > > > >> should
> > > > > > > >> > we
> > > > > > > >> > > > > > > introduce
> > > > > > > >> > > > > > > > > a new configuration to enforce deletion by
> > > > timestamp?
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi wu
> <
> > > > > > > >> > > xiongqiwu@gmail.com
> > > > > > > >> > > > >
> > > > > > > >> > > > > > > wrote:
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > >> Dong,
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> Thanks for the comment.
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> There are two retention policy: log
> > compaction
> > > > and
> > > > > > time
> > > > > > > >> > based
> > > > > > > >> > > > > > > retention.
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> Log compaction:
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> we have use cases to keep infinite
> retention
> > > of a
> > > > > > topic
> > > > > > > >> > (only
> > > > > > > >> > > > > > > > >> compaction). GDPR cares about deletion of
> PII
> > > > > > (personal
> > > > > > > >> > > > > identifiable
> > > > > > > >> > > > > > > > >> information) data.
> > > > > > > >> > > > > > > > >> Since Kafka doesn't know what records
> contain
> > > > PII, it
> > > > > > > >> relies
> > > > > > > >> > > on
> > > > > > > >> > > > > upper
> > > > > > > >> > > > > > > > >> layer to delete those records.
> > > > > > > >> > > > > > > > >> For those infinite retention uses uses,
> kafka
> > > > needs
> > > > > > to
> > > > > > > >> > > provide a
> > > > > > > >> > > > > way
> > > > > > > >> > > > > > > to
> > > > > > > >> > > > > > > > >> enforce compaction on time. This is what we
> > try
> > > > to
> > > > > > > >> address
> > > > > > > >> > in
> > > > > > > >> > > > this
> > > > > > > >> > > > > > > KIP.
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> Time based retention,
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> There are also use cases that users of
> Kafka
> > > > might
> > > > > > want
> > > > > > > >> to
> > > > > > > >> > > > expire
> > > > > > > >> > > > > all
> > > > > > > >> > > > > > > > >> their data.
> > > > > > > >> > > > > > > > >> In those cases, they can use time based
> > > > retention of
> > > > > > > >> their
> > > > > > > >> > > > topics.
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> Regarding your first question, if a user
> > wants
> > > to
> > > > > > > delete
> > > > > > > >> a
> > > > > > > >> > key
> > > > > > > >> > > > in
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > >> log compaction topic, the user has to send
> a
> > > > deletion
> > > > > > > >> using
> > > > > > > >> > > the
> > > > > > > >> > > > > same
> > > > > > > >> > > > > > > > key.
> > > > > > > >> > > > > > > > >> Kafka only makes sure the deletion will
> > happen
> > > > under
> > > > > > a
> > > > > > > >> > certain
> > > > > > > >> > > > > time
> > > > > > > >> > > > > > > > >> periods (like 2 days/7 days).
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> Regarding your second question. In most
> > cases,
> > > we
> > > > > > might
> > > > > > > >> want
> > > > > > > >> > > to
> > > > > > > >> > > > > delete
> > > > > > > >> > > > > > > > >> all duplicated keys at the same time.
> > > > > > > >> > > > > > > > >> Compaction might be more efficient since we
> > > need
> > > > to
> > > > > > > scan
> > > > > > > >> the
> > > > > > > >> > > log
> > > > > > > >> > > > > and
> > > > > > > >> > > > > > > > find
> > > > > > > >> > > > > > > > >> all duplicates. However, the expected use
> > case
> > > > is to
> > > > > > > set
> > > > > > > >> the
> > > > > > > >> > > > time
> > > > > > > >> > > > > > > based
> > > > > > > >> > > > > > > > >> compaction interval on the order of days,
> and
> > > be
> > > > > > larger
> > > > > > > >> than
> > > > > > > >> > > > 'min
> > > > > > > >> > > > > > > > >> compaction lag". We don't want log
> compaction
> > > to
> > > > > > happen
> > > > > > > >> > > > frequently
> > > > > > > >> > > > > > > since
> > > > > > > >> > > > > > > > >> it is expensive. The purpose is to help low
> > > > > > production
> > > > > > > >> rate
> > > > > > > >> > > > topic
> > > > > > > >> > > > > to
> > > > > > > >> > > > > > > get
> > > > > > > >> > > > > > > > >> compacted on time. For the topic with
> > "normal"
> > > > > > incoming
> > > > > > > >> > > message
> > > > > > > >> > > > > > > message
> > > > > > > >> > > > > > > > >> rate, the "min dirty ratio" might have
> > > triggered
> > > > the
> > > > > > > >> > > compaction
> > > > > > > >> > > > > before
> > > > > > > >> > > > > > > > this
> > > > > > > >> > > > > > > > >> time based compaction policy takes effect.
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> Eno,
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> For your question, like I mentioned we have
> > > long
> > > > time
> > > > > > > >> > > retention
> > > > > > > >> > > > > use
> > > > > > > >> > > > > > > case
> > > > > > > >> > > > > > > > >> for log compacted topic, but we want to
> > provide
> > > > > > ability
> > > > > > > >> to
> > > > > > > >> > > > delete
> > > > > > > >> > > > > > > > certain
> > > > > > > >> > > > > > > > >> PII records on time.
> > > > > > > >> > > > > > > > >> Kafka itself doesn't know whether a record
> > > > contains
> > > > > > > >> > sensitive
> > > > > > > >> > > > > > > > information
> > > > > > > >> > > > > > > > >> and relies on the user for deletion.
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin <
> > > > > > > >> > > lindong28@gmail.com>
> > > > > > > >> > > > > > > wrote:
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >>> Hey Xiongqi,
> > > > > > > >> > > > > > > > >>>
> > > > > > > >> > > > > > > > >>> Thanks for the KIP. I have two questions
> > > > regarding
> > > > > > the
> > > > > > > >> > > use-case
> > > > > > > >> > > > > for
> > > > > > > >> > > > > > > > >>> meeting
> > > > > > > >> > > > > > > > >>> GDPR requirement.
> > > > > > > >> > > > > > > > >>>
> > > > > > > >> > > > > > > > >>> 1) If I recall correctly, one of the GDPR
> > > > > > requirement
> > > > > > > is
> > > > > > > >> > that
> > > > > > > >> > > > we
> > > > > > > >> > > > > can
> > > > > > > >> > > > > > > > not
> > > > > > > >> > > > > > > > >>> keep messages longer than e.g. 30 days in
> > > > storage
> > > > > > > (e.g.
> > > > > > > >> > > Kafka).
> > > > > > > >> > > > > Say
> > > > > > > >> > > > > > > > there
> > > > > > > >> > > > > > > > >>> exists a partition p0 which contains
> > message1
> > > > with
> > > > > > > key1
> > > > > > > >> and
> > > > > > > >> > > > > message2
> > > > > > > >> > > > > > > > with
> > > > > > > >> > > > > > > > >>> key2. And then user keeps producing
> messages
> > > > with
> > > > > > > >> key=key2
> > > > > > > >> > to
> > > > > > > >> > > > > this
> > > > > > > >> > > > > > > > >>> partition. Since message1 with key1 is
> never
> > > > > > > overridden,
> > > > > > > >> > > sooner
> > > > > > > >> > > > > or
> > > > > > > >> > > > > > > > later
> > > > > > > >> > > > > > > > >>> we
> > > > > > > >> > > > > > > > >>> will want to delete message1 and keep the
> > > latest
> > > > > > > message
> > > > > > > >> > with
> > > > > > > >> > > > > > > key=key2.
> > > > > > > >> > > > > > > > >>> But
> > > > > > > >> > > > > > > > >>> currently it looks like log compact logic
> in
> > > > Kafka
> > > > > > > will
> > > > > > > >> > > always
> > > > > > > >> > > > > put
> > > > > > > >> > > > > > > > these
> > > > > > > >> > > > > > > > >>> messages in the same segment. Will this be
> > an
> > > > issue?
> > > > > > > >> > > > > > > > >>>
> > > > > > > >> > > > > > > > >>> 2) The current KIP intends to provide the
> > > > capability
> > > > > > > to
> > > > > > > >> > > delete
> > > > > > > >> > > > a
> > > > > > > >> > > > > > > given
> > > > > > > >> > > > > > > > >>> message in log compacted topic. Does such
> > > > use-case
> > > > > > > also
> > > > > > > >> > > require
> > > > > > > >> > > > > Kafka
> > > > > > > >> > > > > > > > to
> > > > > > > >> > > > > > > > >>> keep the messages produced before the
> given
> > > > message?
> > > > > > > If
> > > > > > > >> > yes,
> > > > > > > >> > > > > then we
> > > > > > > >> > > > > > > > can
> > > > > > > >> > > > > > > > >>> probably just use
> > AdminClient.deleteRecords()
> > > or
> > > > > > > >> time-based
> > > > > > > >> > > log
> > > > > > > >> > > > > > > > retention
> > > > > > > >> > > > > > > > >>> to meet the use-case requirement. If no,
> do
> > > you
> > > > know
> > > > > > > >> what
> > > > > > > >> > is
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > GDPR's
> > > > > > > >> > > > > > > > >>> requirement on time-to-deletion after user
> > > > > > explicitly
> > > > > > > >> > > requests
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > >>> deletion
> > > > > > > >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> > > > > > > >> > > > > > > > >>>
> > > > > > > >> > > > > > > > >>> Thanks,
> > > > > > > >> > > > > > > > >>> Dong
> > > > > > > >> > > > > > > > >>>
> > > > > > > >> > > > > > > > >>>
> > > > > > > >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi
> wu
> > <
> > > > > > > >> > > > xiongqiwu@gmail.com
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > > > wrote:
> > > > > > > >> > > > > > > > >>>
> > > > > > > >> > > > > > > > >>> > Hi Eno,
> > > > > > > >> > > > > > > > >>> >
> > > > > > > >> > > > > > > > >>> > The GDPR request we are getting here at
> > > > linkedin
> > > > > > is
> > > > > > > >> if we
> > > > > > > >> > > > get a
> > > > > > > >> > > > > > > > >>> request to
> > > > > > > >> > > > > > > > >>> > delete a record through a null key on a
> > log
> > > > > > > compacted
> > > > > > > >> > > topic,
> > > > > > > >> > > > > > > > >>> > we want to delete the record via
> > compaction
> > > > in a
> > > > > > > given
> > > > > > > >> > time
> > > > > > > >> > > > > period
> > > > > > > >> > > > > > > > >>> like 2
> > > > > > > >> > > > > > > > >>> > days (whatever is required by the
> policy).
> > > > > > > >> > > > > > > > >>> >
> > > > > > > >> > > > > > > > >>> > There might be other issues (such as
> > orphan
> > > > log
> > > > > > > >> segments
> > > > > > > >> > > > under
> > > > > > > >> > > > > > > > certain
> > > > > > > >> > > > > > > > >>> > conditions) that lead to GDPR problem
> but
> > > > they are
> > > > > > > >> more
> > > > > > > >> > > like
> > > > > > > >> > > > > > > > >>> something we
> > > > > > > >> > > > > > > > >>> > need to fix anyway regardless of GDPR.
> > > > > > > >> > > > > > > > >>> >
> > > > > > > >> > > > > > > > >>> >
> > > > > > > >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
> > > > > > > >> > > > > > > > >>> >
> > > > > > > >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno
> > > Thereska
> > > > <
> > > > > > > >> > > > > > > > eno.thereska@gmail.com>
> > > > > > > >> > > > > > > > >>> > wrote:
> > > > > > > >> > > > > > > > >>> >
> > > > > > > >> > > > > > > > >>> > > Hello,
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > > Thanks for the KIP. I'd like to see a
> > more
> > > > > > precise
> > > > > > > >> > > > > definition of
> > > > > > > >> > > > > > > > what
> > > > > > > >> > > > > > > > >>> > part
> > > > > > > >> > > > > > > > >>> > > of GDPR you are targeting as well as
> > some
> > > > sort
> > > > > > of
> > > > > > > >> > > > > verification
> > > > > > > >> > > > > > > that
> > > > > > > >> > > > > > > > >>> this
> > > > > > > >> > > > > > > > >>> > > KIP actually addresses the problem.
> > Right
> > > > now I
> > > > > > > find
> > > > > > > >> > > this a
> > > > > > > >> > > > > bit
> > > > > > > >> > > > > > > > >>> vague:
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > > "Ability to delete a log message
> through
> > > > > > > compaction
> > > > > > > >> in
> > > > > > > >> > a
> > > > > > > >> > > > > timely
> > > > > > > >> > > > > > > > >>> manner
> > > > > > > >> > > > > > > > >>> > has
> > > > > > > >> > > > > > > > >>> > > become an important requirement in
> some
> > > use
> > > > > > cases
> > > > > > > >> > (e.g.,
> > > > > > > >> > > > > GDPR)"
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > > Is there any guarantee that after this
> > KIP
> > > > the
> > > > > > > GDPR
> > > > > > > >> > > problem
> > > > > > > >> > > > > is
> > > > > > > >> > > > > > > > >>> solved or
> > > > > > > >> > > > > > > > >>> > do
> > > > > > > >> > > > > > > > >>> > > we need to do something else as well,
> > > e.g.,
> > > > more
> > > > > > > >> KIPs?
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > > Thanks
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > > Eno
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM,
> xiongqi
> > > wu <
> > > > > > > >> > > > > xiongqiwu@gmail.com>
> > > > > > > >> > > > > > > > >>> wrote:
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> > > > Hi Kafka,
> > > > > > > >> > > > > > > > >>> > > >
> > > > > > > >> > > > > > > > >>> > > > This KIP tries to address GDPR
> concern
> > > to
> > > > > > > fulfill
> > > > > > > >> > > > deletion
> > > > > > > >> > > > > > > > request
> > > > > > > >> > > > > > > > >>> on
> > > > > > > >> > > > > > > > >>> > > time
> > > > > > > >> > > > > > > > >>> > > > through time-based log compaction
> on a
> > > > > > > compaction
> > > > > > > >> > > enabled
> > > > > > > >> > > > > > > topic:
> > > > > > > >> > > > > > > > >>> > > >
> > > > > > > >> > > > > > > > >>> > > >
> > > > > > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> > > > > > > >> > > > > > > > <
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>>
> > > > > > > >> > > > > > > > >>> > > >
> > 354%3A+Time-based+log+compaction+policy
> > > > > > > >> > > > > > > > >>> > > >
> > > > > > > >> > > > > > > > >>> > > > Any feedback will be appreciated.
> > > > > > > >> > > > > > > > >>> > > >
> > > > > > > >> > > > > > > > >>> > > >
> > > > > > > >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
> > > > > > > >> > > > > > > > >>> > > >
> > > > > > > >> > > > > > > > >>> > >
> > > > > > > >> > > > > > > > >>> >
> > > > > > > >> > > > > > > > >>>
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >> --
> > > > > > > >> > > > > > > > >> Xiongqi (Wesley) Wu
> > > > > > > >> > > > > > > > >>
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > --
> > > > > > > >> > > > > > > > > Xiongqi (Wesley) Wu
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > --
> > > > > > > >> > > > > > > > Xiongqi (Wesley) Wu
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > --
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Brett Rann
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Senior DevOps Engineer
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Zendesk International Ltd
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Mobile: +61 (0) 418 826 017
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > --
> > > > > > > >> > > > > > Xiongqi (Wesley) Wu
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > > --
> > > > > > > >> > > > Xiongqi (Wesley) Wu
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > --
> > > > > > > >> > Xiongqi (Wesley) Wu
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >>
> > > > > > > >> Brett Rann
> > > > > > > >>
> > > > > > > >> Senior DevOps Engineer
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Zendesk International Ltd
> > > > > > > >>
> > > > > > > >> 395 Collins Street, Melbourne VIC 3000 Australia
> > > > > > > >>
> > > > > > > >> Mobile: +61 (0) 418 826 017
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Xiongqi (Wesley) Wu
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Brett Rann
> > > > > >
> > > > > > Senior DevOps Engineer
> > > > > >
> > > > > >
> > > > > > Zendesk International Ltd
> > > > > >
> > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > > > >
> > > > > > Mobile: +61 (0) 418 826 017
> > > > > >
> > > >
> > >
> > >
> > > --
> > > -Regards,
> > > Mayuresh R. Gharat
> > > (862) 250-7125
> > >
> >
>

Re: [DISCUSS] KIP-354 Time-based log compaction policy

Posted by Dong Lin <li...@gmail.com>.
Hey Xiongqi,

Sorry for late reply. I have some comments below:

1) As discussed earlier in the email list, if the topic is configured with
both deletion and compaction, in some cases messages produced a long time
ago can not be deleted based on time. This is a valid use-case because we
actually have topic which is configured with both deletion and compaction
policy. And we should enforce the semantics for both policy. Solution A
sounds good. We do not need interface change (e.g. extra config) to enforce
solution A. All we need is to update implementation so that when broker
compacts a topic, if the message has timestamp (which is the common case),
messages that are too old (based on the time-based retention config) will
be discarded. Since this is a valid issue and it is also related to the
guarantee of when a message can be deleted, can we include the solution of
this problem in the KIP?

2) It is probably OK to assume that all messages have timestamp. The
per-message timestamp was introduced into Kafka 0.10.0 with KIP-31 and
KIP-32 as of Feb 2016. Kafka 0.10.0 or earlier versions are no longer
supported. Also, since the use-case for this feature is primarily for GDPR,
we can assume that client library has already been upgraded to support SSL,
which feature is added after KIP-31 and KIP-32.

3) In Proposed Change section 2.a, it is said that segment.largestTimestamp
- maxSegmentMs can be used to determine the timestamp of the earliest
message. Would it be simpler to just use the create time of the file to
determine the time?

4) The KIP suggests to use must-clean-ratio to select the partition to be
compacted. Unlike dirty ratio which is mostly for performance, the logs
whose "must-clean-ratio" is non-zero must be compacted immediately for
correctness reason (and for GDPR). And if this can no be achieved because
e.g. broker compaction throughput is too low, investigation will be needed.
So it seems simpler to first compact logs which has segment whose earliest
timetamp is earlier than now - max.compaction.lag.ms, instead of defining
must-clean-ratio and sorting logs based on this value.

5) The KIP says max.compaction.lag.ms is 0 by default and it is also
suggested that 0 means disable. Should we set this value to MAX_LONG by
default to effectively disable the feature added in this KIP?

6) It is probably cleaner and readable not to include in Public Interface
section those configs whose meaning is not changed.

7) The goal of this KIP is to ensure that log segment whose earliest
message is earlier than a given threshold will be compacted. This goal may
not be achieved if the compact throughput can not catchup with the total
bytes-in-rate for the compacted topics on the broker. Thus we need an easy
way to tell operator whether this goal is achieved. If we don't already
have such metric, maybe we can include metrics to show 1) the total number
of log segments (or logs) which needs to be immediately compacted as
determined by max.compaction.lag; and 2) the maximum value of now -
earliest_time_stamp_of_segment among all segments that needs to be
compacted.

8) The Performance Impact suggests user to use the existing metrics to
monitor the performance impact of this KIP. It i useful to list mean of
each jmx metrics that we want user to monitor, and possibly explain how to
interpret the value of these metrics to determine whether there is
performance issue.

Thanks,
Dong

On Tue, Oct 16, 2018 at 10:53 AM xiongqi wu <xi...@gmail.com> wrote:

> Mayuresh,
>
> Thanks for the comments.
> The requirement is that we need to pick up segments that are older than
> maxCompactionLagMs for compaction.
> maxCompactionLagMs is an upper-bound, which implies that picking up
> segments for compaction earlier doesn't violated the policy.
> We use the creation time of a segment as an estimation of its records
> arrival time, so these records can be compacted no later than
> maxCompactionLagMs.
>
> On the other hand, compaction is an expensive operation, we don't want to
> compact the log partition whenever a new segment is sealed.
> Therefore, we want to pick up a segment for compaction when the segment is
> closed to mandatory max compaction lag (so we use segment creation time as
> an estimation.)
>
>
> Xiongqi (Wesley) Wu
>
>
> On Mon, Oct 15, 2018 at 5:54 PM Mayuresh Gharat <
> gharatmayuresh15@gmail.com>
> wrote:
>
> > Hi Wesley,
> >
> > Thanks for the KIP and sorry for being late to the party.
> >  I wanted to understand, the scenario you mentioned in Proposed changes :
> >
> > -
> > >
> > > Estimate the earliest message timestamp of an un-compacted log segment.
> > we
> > > only need to estimate earliest message timestamp for un-compacted log
> > > segments to ensure timely compaction because the deletion requests that
> > > belong to compacted segments have already been processed.
> > >
> > >    1.
> > >
> > >    for the first (earliest) log segment:  The estimated earliest
> > >    timestamp is set to the timestamp of the first message if timestamp
> is
> > >    present in the message. Otherwise, the estimated earliest timestamp
> > is set
> > >    to "segment.largestTimestamp - maxSegmentMs”
> > >     (segment.largestTimestamp is lastModified time of the log segment
> or
> > max
> > >    timestamp we see for the log segment.). In the later case, the
> actual
> > >    timestamp of the first message might be later than the estimation,
> > but it
> > >    is safe to pick up the log for compaction earlier.
> > >
> > > When we say "actual timestamp of the first message might be later than
> > the
> > estimation, but it is safe to pick up the log for compaction earlier.",
> > doesn't that violate the assumption that we will consider a segment for
> > compaction only if the time of creation the segment has crossed the "now
> -
> > maxCompactionLagMs" ?
> >
> > Thanks,
> >
> > Mayuresh
> >
> > On Mon, Sep 3, 2018 at 7:28 PM Brett Rann <br...@zendesk.com.invalid>
> > wrote:
> >
> > > Might also be worth moving to a vote thread? Discussion seems to have
> > gone
> > > as far as it can.
> > >
> > > > On 4 Sep 2018, at 12:08, xiongqi wu <xi...@gmail.com> wrote:
> > > >
> > > > Brett,
> > > >
> > > > Yes, I will post PR tomorrow.
> > > >
> > > > Xiongqi (Wesley) Wu
> > > >
> > > >
> > > > On Sun, Sep 2, 2018 at 6:28 PM Brett Rann <brann@zendesk.com.invalid
> >
> > > wrote:
> > > >
> > > > > +1 (non-binding) from me on the interface. I'd like to see someone
> > > familiar
> > > > > with
> > > > > the code comment on the approach, and note there's a couple of
> > > different
> > > > > approaches: what's documented in the KIP, and what Xiaohe Dong was
> > > working
> > > > > on
> > > > > here:
> > > > >
> > > > >
> > >
> >
> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0
> > > > >
> > > > > If you have code working already Xiongqi Wu could you share a PR?
> I'd
> > > be
> > > > > happy
> > > > > to start testing.
> > > > >
> > > > > On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <xi...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > Do you have any additional comments on this KIP?
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <xiongqiwu@gmail.com
> >
> > > wrote:
> > > > > >
> > > > > > > on 2)
> > > > > > > The offsetmap is built starting from dirty segment.
> > > > > > > The compaction starts from the beginning of the log partition.
> > > That's
> > > > > how
> > > > > > > it ensure the deletion of tomb keys.
> > > > > > > I will double check tomorrow.
> > > > > > >
> > > > > > > Xiongqi (Wesley) Wu
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann
> > > <br...@zendesk.com.invalid>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> To just clarify a bit on 1. whether there's an external
> > storage/DB
> > > > > isn't
> > > > > > >> relevant here.
> > > > > > >> Compacted topics allow a tombstone record to be sent (a null
> > value
> > > > > for a
> > > > > > >> key) which
> > > > > > >> currently will result in old values for that key being deleted
> > if
> > > some
> > > > > > >> conditions are met.
> > > > > > >> There are existing controls to make sure the old values will
> > stay
> > > > > around
> > > > > > >> for a minimum
> > > > > > >> time at least, but no dedicated control to ensure the
> tombstone
> > > will
> > > > > > >> delete
> > > > > > >> within a
> > > > > > >> maximum time.
> > > > > > >>
> > > > > > >> One popular reason that maximum time for deletion is desirable
> > > right
> > > > > now
> > > > > > >> is
> > > > > > >> GDPR with
> > > > > > >> PII. But we're not proposing any GDPR awareness in kafka, just
> > > being
> > > > > > able
> > > > > > >> to guarantee
> > > > > > >> a max time where a tombstoned key will be removed from the
> > > compacted
> > > > > > >> topic.
> > > > > > >>
> > > > > > >> on 2)
> > > > > > >> huh, i thought it kept track of the first dirty segment and
> > didn't
> > > > > > >> recompact older "clean" ones.
> > > > > > >> But I didn't look at code or test for that.
> > > > > > >>
> > > > > > >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <
> > xiongqiwu@gmail.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> > 1, Owner of data (in this sense, kafka is the not the owner
> of
> > > data)
> > > > > > >> > should keep track of lifecycle of the data in some external
> > > > > > storage/DB.
> > > > > > >> > The owner determines when to delete the data and send the
> > delete
> > > > > > >> request to
> > > > > > >> > kafka. Kafka doesn't know about the content of data but to
> > > provide a
> > > > > > >> mean
> > > > > > >> > for deletion.
> > > > > > >> >
> > > > > > >> > 2 , each time compaction runs, it will start from first
> > > segments (no
> > > > > > >> > matter if it is compacted or not). The time estimation here
> is
> > > only
> > > > > > used
> > > > > > >> > to determine whether we should run compaction on this log
> > > partition.
> > > > > > So
> > > > > > >> we
> > > > > > >> > only need to estimate uncompacted segments.
> > > > > > >> >
> > > > > > >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <
> > lindong28@gmail.com>
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> > > Hey Xiongqi,
> > > > > > >> > >
> > > > > > >> > > Thanks for the update. I have two questions for the latest
> > > KIP.
> > > > > > >> > >
> > > > > > >> > > 1) The motivation section says that one use case is to
> > delete
> > > PII
> > > > > > >> > (Personal
> > > > > > >> > > Identifiable information) data within 7 days while keeping
> > > non-PII
> > > > > > >> > > indefinitely in compacted format. I suppose the use-case
> > > depends
> > > > > on
> > > > > > >> the
> > > > > > >> > > application to determine when to delete those PII data.
> > Could
> > > you
> > > > > > >> explain
> > > > > > >> > > how can application reliably determine the set of keys
> that
> > > should
> > > > > > be
> > > > > > >> > > deleted? Is application required to always messages from
> the
> > > topic
> > > > > > >> after
> > > > > > >> > > every restart and determine the keys to be deleted by
> > looking
> > > at
> > > > > > >> message
> > > > > > >> > > timestamp, or is application supposed to persist the key->
> > > > > timstamp
> > > > > > >> > > information in a separate persistent storage system?
> > > > > > >> > >
> > > > > > >> > > 2) It is mentioned in the KIP that "we only need to
> estimate
> > > > > > earliest
> > > > > > >> > > message timestamp for un-compacted log segments because
> the
> > > > > deletion
> > > > > > >> > > requests that belong to compacted segments have already
> been
> > > > > > >> processed".
> > > > > > >> > > Not sure if it is correct. If a segment is compacted
> before
> > > user
> > > > > > sends
> > > > > > >> > > message to delete a key in this segment, it seems that we
> > > still
> > > > > need
> > > > > > >> to
> > > > > > >> > > ensure that the segment will be compacted again within the
> > > given
> > > > > > time
> > > > > > >> > after
> > > > > > >> > > the deletion is requested, right?
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Dong
> > > > > > >> > >
> > > > > > >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <
> > > xiongqiwu@gmail.com
> > > > > >
> > > > > > >> > wrote:
> > > > > > >> > >
> > > > > > >> > > > Hi Xiaohe,
> > > > > > >> > > >
> > > > > > >> > > > Quick note:
> > > > > > >> > > > 1) Use minimum of segment.ms and max.compaction.lag.ms
> > > > > > >> > > > <http://max.compaction.ms
> > > > > > <http://max.compaction.ms>
> > > > > > >> > <http://max.compaction.ms
> > > > > > <http://max.compaction.ms>>>
> > > > > > >> > > >
> > > > > > >> > > > 2) I am not sure if I get your second question. first,
> we
> > > have
> > > > > > >> jitter
> > > > > > >> > > when
> > > > > > >> > > > we roll the active segment. second, on each compaction,
> we
> > > > > compact
> > > > > > >> upto
> > > > > > >> > > > the offsetmap could allow. Those will not lead to
> perfect
> > > > > > compaction
> > > > > > >> > > storm
> > > > > > >> > > > overtime. In addition, I expect we are setting
> > > > > > >> max.compaction.lag.ms
> > > > > > >> > on
> > > > > > >> > > > the order of days.
> > > > > > >> > > >
> > > > > > >> > > > 3) I don't have access to the confluent community slack
> > for
> > > > > now. I
> > > > > > >> am
> > > > > > >> > > > reachable via the google handle out.
> > > > > > >> > > > To avoid the double effort, here is my plan:
> > > > > > >> > > > a) Collect more feedback and feature requriement on the
> > KIP.
> > > > > > >> > > > b) Wait unitl this KIP is approved.
> > > > > > >> > > > c) I will address any additional requirements in the
> > > > > > implementation.
> > > > > > >> > (My
> > > > > > >> > > > current implementation only complies to whatever
> described
> > > in
> > > > > the
> > > > > > >> KIP
> > > > > > >> > > now)
> > > > > > >> > > > d) I can share the code with the you and community see
> you
> > > want
> > > > > to
> > > > > > >> add
> > > > > > >> > > > anything.
> > > > > > >> > > > e) submission through committee
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
> > > > > > >> dannyrivclo@gmail.com>
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi Xiongqi
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks for thinking about implementing this as well.
> :)
> > > > > > >> > > > >
> > > > > > >> > > > > I was thinking about using `segment.ms` to trigger
> the
> > > > > segment
> > > > > > >> roll.
> > > > > > >> > > > > Also, its value can be the largest time bias for the
> > > record
> > > > > > >> deletion.
> > > > > > >> > > For
> > > > > > >> > > > > example, if the `segment.ms` is 1 day and `
> > > max.compaction.ms`
> > > > > > is
> > > > > > >> 30
> > > > > > >> > > > days,
> > > > > > >> > > > > the compaction may happen around 31 days.
> > > > > > >> > > > >
> > > > > > >> > > > > For my curiosity, is there a way we can do some
> > > performance
> > > > > test
> > > > > > >> for
> > > > > > >> > > this
> > > > > > >> > > > > and any tools you can recommend. As you know,
> > previously,
> > > it
> > > > > is
> > > > > > >> > cleaned
> > > > > > >> > > > up
> > > > > > >> > > > > by respecting dirty ratio, but now it may happen
> anytime
> > > if
> > > > > max
> > > > > > >> lag
> > > > > > >> > has
> > > > > > >> > > > > passed for each message. I wonder what would happen if
> > > clients
> > > > > > >> send
> > > > > > >> > > huge
> > > > > > >> > > > > amount of tombstone records at the same time.
> > > > > > >> > > > >
> > > > > > >> > > > > I am looking forward to have a quick chat with you to
> > > avoid
> > > > > > double
> > > > > > >> > > effort
> > > > > > >> > > > > on this. I am in confluent community slack during the
> > work
> > > > > time.
> > > > > > >> My
> > > > > > >> > > name
> > > > > > >> > > > is
> > > > > > >> > > > > Xiaohe Dong. :)
> > > > > > >> > > > >
> > > > > > >> > > > > Rgds
> > > > > > >> > > > > Xiaohe Dong
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > On 2018/08/16 01:22:22, xiongqi wu <
> xiongqiwu@gmail.com
> > >
> > > > > wrote:
> > > > > > >> > > > > > Brett,
> > > > > > >> > > > > >
> > > > > > >> > > > > > Thank you for your comments.
> > > > > > >> > > > > > I was thinking since we already has immediate
> > compaction
> > > > > > >> setting by
> > > > > > >> > > > > setting
> > > > > > >> > > > > > min dirty ratio to 0, so I decide to use "0" as
> > disabled
> > > > > > state.
> > > > > > >> > > > > > I am ok to go with -1(disable), 0 (immediate)
> options.
> > > > > > >> > > > > >
> > > > > > >> > > > > > For the implementation, there are a few differences
> > > between
> > > > > > mine
> > > > > > >> > and
> > > > > > >> > > > > > "Xiaohe Dong"'s :
> > > > > > >> > > > > > 1) I used the estimated creation time of a log
> segment
> > > > > instead
> > > > > > >> of
> > > > > > >> > > > largest
> > > > > > >> > > > > > timestamp of a log to determine the compaction
> > > eligibility,
> > > > > > >> > because a
> > > > > > >> > > > log
> > > > > > >> > > > > > segment might stay as an active segment up to "max
> > > > > compaction
> > > > > > >> lag".
> > > > > > >> > > > (see
> > > > > > >> > > > > > the KIP for detail).
> > > > > > >> > > > > > 2) I measure how much bytes that we must clean to
> > > follow the
> > > > > > >> "max
> > > > > > >> > > > > > compaction lag" rule, and use that to determine the
> > > order of
> > > > > > >> > > > compaction.
> > > > > > >> > > > > > 3) force active segment to roll to follow the "max
> > > > > compaction
> > > > > > >> lag"
> > > > > > >> > > > > >
> > > > > > >> > > > > > I can share my code so we can coordinate.
> > > > > > >> > > > > >
> > > > > > >> > > > > > I haven't think about a new API to force a
> compaction.
> > > what
> > > > > is
> > > > > > >> the
> > > > > > >> > > use
> > > > > > >> > > > > case
> > > > > > >> > > > > > for this one?
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> > > > > > >> > > <brann@zendesk.com.invalid
> > > > > > >> > > > >
> > > > > > >> > > > > > wrote:
> > > > > > >> > > > > >
> > > > > > >> > > > > > > We've been looking into this too.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Mailing list:
> > > > > > >> > > > > > > https://lists.apache.org/thread.html/
> > > > > > <https://lists.apache.org/thread.html/>
> > > > > > >> > <https://lists.apache.org/thread.html/
> > > > > > <https://lists.apache.org/thread.html/>>
> > > > > > >> > > ed7f6a6589f94e8c2a705553f364ef
> > > > > > >> > > > > > > 599cb6915e4c3ba9b561e610e4@%
> 3Cdev.kafka.apache.org
> > %3E
> > > > > > >> > > > > > > jira wish:
> > > > > https://issues.apache.org/jira/browse/KAFKA-7137
> > > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>
> > > > > > >> > <https://issues.apache.org/jira/browse/KAFKA-7137
> > > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>>
> > > > > > >> > > > > > > confluent slack discussion:
> > > > > > >> > > > > > >
> > > https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>
> > > > > > >> > <https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>>
> > > > > > >> > > > > p1530760121000039
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > A person on my team has started on code so you
> might
> > > want
> > > > > to
> > > > > > >> > > > > coordinate:
> > > > > > >> > > > > > >
> > > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
> > > > > > >> > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->>
> > > > > > >> > > > > > > cleaner-compaction-max-lifetime-2.0
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > He's been working with Jason Gustafson and James
> > Chen
> > > > > around
> > > > > > >> the
> > > > > > >> > > > > changes.
> > > > > > >> > > > > > > You can ping him on confluent slack as Xiaohe
> Dong.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > It's great to know others are thinking on it as
> > well.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > You've added the requirement to force a segment
> roll
> > > which
> > > > > > we
> > > > > > >> > > hadn't
> > > > > > >> > > > > gotten
> > > > > > >> > > > > > > to yet, which is great. I was content with it not
> > > > > including
> > > > > > >> the
> > > > > > >> > > > active
> > > > > > >> > > > > > > segment.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > Adding topic level configuration "
> > > max.compaction.lag.ms
> > > > > ",
> > > > > > >> and
> > > > > > >> > > > > > > corresponding broker configuration "
> > > > > > >> > log.cleaner.max.compaction.la
> > > > > > >> > > > g.ms
> > > > > > >> > > > > ",
> > > > > > >> > > > > > > which is set to 0 (disabled) by default.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Glancing at some other settings convention seems
> to
> > > me to
> > > > > be
> > > > > > >> -1
> > > > > > >> > for
> > > > > > >> > > > > > > disabled (or infinite, which is more meaningful
> > > here). 0
> > > > > to
> > > > > > me
> > > > > > >> > > > implies
> > > > > > >> > > > > > > instant, a little quicker than 1.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > We've been trying to think about a way to trigger
> > > > > compaction
> > > > > > >> as
> > > > > > >> > > well
> > > > > > >> > > > > > > through an API call, which would need to be
> flagged
> > > > > > somewhere
> > > > > > >> (ZK
> > > > > > >> > > > > admin/
> > > > > > >> > > > > > > space?) but we're struggling to think how that
> would
> > > be
> > > > > > >> > coordinated
> > > > > > >> > > > > across
> > > > > > >> > > > > > > brokers and partitions. Have you given any thought
> > to
> > > > > that?
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <
> > > > > > >> xiongqiwu@gmail.com>
> > > > > > >> > > > > wrote:
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > Eno, Dong,
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > I have updated the KIP. We decide not to address
> > the
> > > > > issue
> > > > > > >> that
> > > > > > >> > > we
> > > > > > >> > > > > might
> > > > > > >> > > > > > > > have for both compaction and time retention
> > enabled
> > > > > topics
> > > > > > >> (see
> > > > > > >> > > the
> > > > > > >> > > > > > > > rejected alternative item 2). This KIP will only
> > > ensure
> > > > > > log
> > > > > > >> can
> > > > > > >> > > be
> > > > > > >> > > > > > > > compacted after a specified time-interval.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > As suggested by Dong, we will also enforce "
> > > > > > >> > > max.compaction.lag.ms"
> > > > > > >> > > > > is
> > > > > > >> > > > > > > not
> > > > > > >> > > > > > > > less than "min.compaction.lag.ms".
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > > > > >> > > > > Time-based
> > > > > > >> > > > > > > log
> > > > > > >> > > > > > > > compaction policy
> > > > > > >> > > > > > > > <
> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > > > > >> > > > > Time-based
> > > > > > >> > > > > > > log compaction policy>
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu <
> > > > > > >> > xiongqiwu@gmail.com
> > > > > > >> > > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Per discussion with Dong, he made a very good
> > > point
> > > > > that
> > > > > > >> if
> > > > > > >> > > > > compaction
> > > > > > >> > > > > > > > > and time based retention are both enabled on a
> > > topic,
> > > > > > the
> > > > > > >> > > > > compaction
> > > > > > >> > > > > > > > might
> > > > > > >> > > > > > > > > prevent records from being deleted on time.
> The
> > > reason
> > > > > > is
> > > > > > >> > when
> > > > > > >> > > > > > > compacting
> > > > > > >> > > > > > > > > multiple segments into one single segment, the
> > > newly
> > > > > > >> created
> > > > > > >> > > > > segment
> > > > > > >> > > > > > > will
> > > > > > >> > > > > > > > > have same lastmodified timestamp as latest
> > > original
> > > > > > >> segment.
> > > > > > >> > We
> > > > > > >> > > > > lose
> > > > > > >> > > > > > > the
> > > > > > >> > > > > > > > > timestamp of all original segments except the
> > last
> > > > > one.
> > > > > > >> As a
> > > > > > >> > > > > result,
> > > > > > >> > > > > > > > > records might not be deleted as it should be
> > > through
> > > > > > time
> > > > > > >> > based
> > > > > > >> > > > > > > > retention.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > With the current KIP proposal, if we want to
> > > ensure
> > > > > > timely
> > > > > > >> > > > > deletion, we
> > > > > > >> > > > > > > > > have the following configurations:
> > > > > > >> > > > > > > > > 1) enable time based log compaction only :
> > > deletion is
> > > > > > >> done
> > > > > > >> > > > though
> > > > > > >> > > > > > > > > overriding the same key
> > > > > > >> > > > > > > > > 2) enable time based log retention only:
> > deletion
> > > is
> > > > > > done
> > > > > > >> > > though
> > > > > > >> > > > > > > > > time-based retention
> > > > > > >> > > > > > > > > 3) enable both log compaction and time based
> > > > > retention:
> > > > > > >> > > Deletion
> > > > > > >> > > > > is not
> > > > > > >> > > > > > > > > guaranteed.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Not sure if we have use case 3 and also want
> > > deletion
> > > > > to
> > > > > > >> > happen
> > > > > > >> > > > on
> > > > > > >> > > > > > > time.
> > > > > > >> > > > > > > > > There are several options to address deletion
> > > issue
> > > > > when
> > > > > > >> > enable
> > > > > > >> > > > > both
> > > > > > >> > > > > > > > > compaction and retention:
> > > > > > >> > > > > > > > > A) During log compaction, looking into record
> > > > > timestamp
> > > > > > to
> > > > > > >> > > delete
> > > > > > >> > > > > > > expired
> > > > > > >> > > > > > > > > records. This can be done in compaction logic
> > > itself
> > > > > or
> > > > > > >> use
> > > > > > >> > > > > > > > > AdminClient.deleteRecords() . But this assumes
> > we
> > > have
> > > > > > >> record
> > > > > > >> > > > > > > timestamp.
> > > > > > >> > > > > > > > > B) retain the lastModifed time of original
> > > segments
> > > > > > during
> > > > > > >> > log
> > > > > > >> > > > > > > > compaction.
> > > > > > >> > > > > > > > > This requires extra meta data to record the
> > > > > information
> > > > > > or
> > > > > > >> > not
> > > > > > >> > > > > grouping
> > > > > > >> > > > > > > > > multiple segments into one during compaction.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > If we have use case 3 in general, I would
> prefer
> > > > > > solution
> > > > > > >> A
> > > > > > >> > and
> > > > > > >> > > > > rely on
> > > > > > >> > > > > > > > > record timestamp.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Two questions:
> > > > > > >> > > > > > > > > Do we have use case 3? Is it nice to have or
> > must
> > > > > have?
> > > > > > >> > > > > > > > > If we have use case 3 and want to go with
> > > solution A,
> > > > > > >> should
> > > > > > >> > we
> > > > > > >> > > > > > > introduce
> > > > > > >> > > > > > > > > a new configuration to enforce deletion by
> > > timestamp?
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi wu <
> > > > > > >> > > xiongqiwu@gmail.com
> > > > > > >> > > > >
> > > > > > >> > > > > > > wrote:
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > >> Dong,
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> Thanks for the comment.
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> There are two retention policy: log
> compaction
> > > and
> > > > > time
> > > > > > >> > based
> > > > > > >> > > > > > > retention.
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> Log compaction:
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> we have use cases to keep infinite retention
> > of a
> > > > > topic
> > > > > > >> > (only
> > > > > > >> > > > > > > > >> compaction). GDPR cares about deletion of PII
> > > > > (personal
> > > > > > >> > > > > identifiable
> > > > > > >> > > > > > > > >> information) data.
> > > > > > >> > > > > > > > >> Since Kafka doesn't know what records contain
> > > PII, it
> > > > > > >> relies
> > > > > > >> > > on
> > > > > > >> > > > > upper
> > > > > > >> > > > > > > > >> layer to delete those records.
> > > > > > >> > > > > > > > >> For those infinite retention uses uses, kafka
> > > needs
> > > > > to
> > > > > > >> > > provide a
> > > > > > >> > > > > way
> > > > > > >> > > > > > > to
> > > > > > >> > > > > > > > >> enforce compaction on time. This is what we
> try
> > > to
> > > > > > >> address
> > > > > > >> > in
> > > > > > >> > > > this
> > > > > > >> > > > > > > KIP.
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> Time based retention,
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> There are also use cases that users of Kafka
> > > might
> > > > > want
> > > > > > >> to
> > > > > > >> > > > expire
> > > > > > >> > > > > all
> > > > > > >> > > > > > > > >> their data.
> > > > > > >> > > > > > > > >> In those cases, they can use time based
> > > retention of
> > > > > > >> their
> > > > > > >> > > > topics.
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> Regarding your first question, if a user
> wants
> > to
> > > > > > delete
> > > > > > >> a
> > > > > > >> > key
> > > > > > >> > > > in
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > >> log compaction topic, the user has to send a
> > > deletion
> > > > > > >> using
> > > > > > >> > > the
> > > > > > >> > > > > same
> > > > > > >> > > > > > > > key.
> > > > > > >> > > > > > > > >> Kafka only makes sure the deletion will
> happen
> > > under
> > > > > a
> > > > > > >> > certain
> > > > > > >> > > > > time
> > > > > > >> > > > > > > > >> periods (like 2 days/7 days).
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> Regarding your second question. In most
> cases,
> > we
> > > > > might
> > > > > > >> want
> > > > > > >> > > to
> > > > > > >> > > > > delete
> > > > > > >> > > > > > > > >> all duplicated keys at the same time.
> > > > > > >> > > > > > > > >> Compaction might be more efficient since we
> > need
> > > to
> > > > > > scan
> > > > > > >> the
> > > > > > >> > > log
> > > > > > >> > > > > and
> > > > > > >> > > > > > > > find
> > > > > > >> > > > > > > > >> all duplicates. However, the expected use
> case
> > > is to
> > > > > > set
> > > > > > >> the
> > > > > > >> > > > time
> > > > > > >> > > > > > > based
> > > > > > >> > > > > > > > >> compaction interval on the order of days, and
> > be
> > > > > larger
> > > > > > >> than
> > > > > > >> > > > 'min
> > > > > > >> > > > > > > > >> compaction lag". We don't want log compaction
> > to
> > > > > happen
> > > > > > >> > > > frequently
> > > > > > >> > > > > > > since
> > > > > > >> > > > > > > > >> it is expensive. The purpose is to help low
> > > > > production
> > > > > > >> rate
> > > > > > >> > > > topic
> > > > > > >> > > > > to
> > > > > > >> > > > > > > get
> > > > > > >> > > > > > > > >> compacted on time. For the topic with
> "normal"
> > > > > incoming
> > > > > > >> > > message
> > > > > > >> > > > > > > message
> > > > > > >> > > > > > > > >> rate, the "min dirty ratio" might have
> > triggered
> > > the
> > > > > > >> > > compaction
> > > > > > >> > > > > before
> > > > > > >> > > > > > > > this
> > > > > > >> > > > > > > > >> time based compaction policy takes effect.
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> Eno,
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> For your question, like I mentioned we have
> > long
> > > time
> > > > > > >> > > retention
> > > > > > >> > > > > use
> > > > > > >> > > > > > > case
> > > > > > >> > > > > > > > >> for log compacted topic, but we want to
> provide
> > > > > ability
> > > > > > >> to
> > > > > > >> > > > delete
> > > > > > >> > > > > > > > certain
> > > > > > >> > > > > > > > >> PII records on time.
> > > > > > >> > > > > > > > >> Kafka itself doesn't know whether a record
> > > contains
> > > > > > >> > sensitive
> > > > > > >> > > > > > > > information
> > > > > > >> > > > > > > > >> and relies on the user for deletion.
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin <
> > > > > > >> > > lindong28@gmail.com>
> > > > > > >> > > > > > > wrote:
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >>> Hey Xiongqi,
> > > > > > >> > > > > > > > >>>
> > > > > > >> > > > > > > > >>> Thanks for the KIP. I have two questions
> > > regarding
> > > > > the
> > > > > > >> > > use-case
> > > > > > >> > > > > for
> > > > > > >> > > > > > > > >>> meeting
> > > > > > >> > > > > > > > >>> GDPR requirement.
> > > > > > >> > > > > > > > >>>
> > > > > > >> > > > > > > > >>> 1) If I recall correctly, one of the GDPR
> > > > > requirement
> > > > > > is
> > > > > > >> > that
> > > > > > >> > > > we
> > > > > > >> > > > > can
> > > > > > >> > > > > > > > not
> > > > > > >> > > > > > > > >>> keep messages longer than e.g. 30 days in
> > > storage
> > > > > > (e.g.
> > > > > > >> > > Kafka).
> > > > > > >> > > > > Say
> > > > > > >> > > > > > > > there
> > > > > > >> > > > > > > > >>> exists a partition p0 which contains
> message1
> > > with
> > > > > > key1
> > > > > > >> and
> > > > > > >> > > > > message2
> > > > > > >> > > > > > > > with
> > > > > > >> > > > > > > > >>> key2. And then user keeps producing messages
> > > with
> > > > > > >> key=key2
> > > > > > >> > to
> > > > > > >> > > > > this
> > > > > > >> > > > > > > > >>> partition. Since message1 with key1 is never
> > > > > > overridden,
> > > > > > >> > > sooner
> > > > > > >> > > > > or
> > > > > > >> > > > > > > > later
> > > > > > >> > > > > > > > >>> we
> > > > > > >> > > > > > > > >>> will want to delete message1 and keep the
> > latest
> > > > > > message
> > > > > > >> > with
> > > > > > >> > > > > > > key=key2.
> > > > > > >> > > > > > > > >>> But
> > > > > > >> > > > > > > > >>> currently it looks like log compact logic in
> > > Kafka
> > > > > > will
> > > > > > >> > > always
> > > > > > >> > > > > put
> > > > > > >> > > > > > > > these
> > > > > > >> > > > > > > > >>> messages in the same segment. Will this be
> an
> > > issue?
> > > > > > >> > > > > > > > >>>
> > > > > > >> > > > > > > > >>> 2) The current KIP intends to provide the
> > > capability
> > > > > > to
> > > > > > >> > > delete
> > > > > > >> > > > a
> > > > > > >> > > > > > > given
> > > > > > >> > > > > > > > >>> message in log compacted topic. Does such
> > > use-case
> > > > > > also
> > > > > > >> > > require
> > > > > > >> > > > > Kafka
> > > > > > >> > > > > > > > to
> > > > > > >> > > > > > > > >>> keep the messages produced before the given
> > > message?
> > > > > > If
> > > > > > >> > yes,
> > > > > > >> > > > > then we
> > > > > > >> > > > > > > > can
> > > > > > >> > > > > > > > >>> probably just use
> AdminClient.deleteRecords()
> > or
> > > > > > >> time-based
> > > > > > >> > > log
> > > > > > >> > > > > > > > retention
> > > > > > >> > > > > > > > >>> to meet the use-case requirement. If no, do
> > you
> > > know
> > > > > > >> what
> > > > > > >> > is
> > > > > > >> > > > the
> > > > > > >> > > > > > > GDPR's
> > > > > > >> > > > > > > > >>> requirement on time-to-deletion after user
> > > > > explicitly
> > > > > > >> > > requests
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > >>> deletion
> > > > > > >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> > > > > > >> > > > > > > > >>>
> > > > > > >> > > > > > > > >>> Thanks,
> > > > > > >> > > > > > > > >>> Dong
> > > > > > >> > > > > > > > >>>
> > > > > > >> > > > > > > > >>>
> > > > > > >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi wu
> <
> > > > > > >> > > > xiongqiwu@gmail.com
> > > > > > >> > > > > >
> > > > > > >> > > > > > > > wrote:
> > > > > > >> > > > > > > > >>>
> > > > > > >> > > > > > > > >>> > Hi Eno,
> > > > > > >> > > > > > > > >>> >
> > > > > > >> > > > > > > > >>> > The GDPR request we are getting here at
> > > linkedin
> > > > > is
> > > > > > >> if we
> > > > > > >> > > > get a
> > > > > > >> > > > > > > > >>> request to
> > > > > > >> > > > > > > > >>> > delete a record through a null key on a
> log
> > > > > > compacted
> > > > > > >> > > topic,
> > > > > > >> > > > > > > > >>> > we want to delete the record via
> compaction
> > > in a
> > > > > > given
> > > > > > >> > time
> > > > > > >> > > > > period
> > > > > > >> > > > > > > > >>> like 2
> > > > > > >> > > > > > > > >>> > days (whatever is required by the policy).
> > > > > > >> > > > > > > > >>> >
> > > > > > >> > > > > > > > >>> > There might be other issues (such as
> orphan
> > > log
> > > > > > >> segments
> > > > > > >> > > > under
> > > > > > >> > > > > > > > certain
> > > > > > >> > > > > > > > >>> > conditions) that lead to GDPR problem but
> > > they are
> > > > > > >> more
> > > > > > >> > > like
> > > > > > >> > > > > > > > >>> something we
> > > > > > >> > > > > > > > >>> > need to fix anyway regardless of GDPR.
> > > > > > >> > > > > > > > >>> >
> > > > > > >> > > > > > > > >>> >
> > > > > > >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
> > > > > > >> > > > > > > > >>> >
> > > > > > >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno
> > Thereska
> > > <
> > > > > > >> > > > > > > > eno.thereska@gmail.com>
> > > > > > >> > > > > > > > >>> > wrote:
> > > > > > >> > > > > > > > >>> >
> > > > > > >> > > > > > > > >>> > > Hello,
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > > Thanks for the KIP. I'd like to see a
> more
> > > > > precise
> > > > > > >> > > > > definition of
> > > > > > >> > > > > > > > what
> > > > > > >> > > > > > > > >>> > part
> > > > > > >> > > > > > > > >>> > > of GDPR you are targeting as well as
> some
> > > sort
> > > > > of
> > > > > > >> > > > > verification
> > > > > > >> > > > > > > that
> > > > > > >> > > > > > > > >>> this
> > > > > > >> > > > > > > > >>> > > KIP actually addresses the problem.
> Right
> > > now I
> > > > > > find
> > > > > > >> > > this a
> > > > > > >> > > > > bit
> > > > > > >> > > > > > > > >>> vague:
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > > "Ability to delete a log message through
> > > > > > compaction
> > > > > > >> in
> > > > > > >> > a
> > > > > > >> > > > > timely
> > > > > > >> > > > > > > > >>> manner
> > > > > > >> > > > > > > > >>> > has
> > > > > > >> > > > > > > > >>> > > become an important requirement in some
> > use
> > > > > cases
> > > > > > >> > (e.g.,
> > > > > > >> > > > > GDPR)"
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > > Is there any guarantee that after this
> KIP
> > > the
> > > > > > GDPR
> > > > > > >> > > problem
> > > > > > >> > > > > is
> > > > > > >> > > > > > > > >>> solved or
> > > > > > >> > > > > > > > >>> > do
> > > > > > >> > > > > > > > >>> > > we need to do something else as well,
> > e.g.,
> > > more
> > > > > > >> KIPs?
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > > Thanks
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > > Eno
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM, xiongqi
> > wu <
> > > > > > >> > > > > xiongqiwu@gmail.com>
> > > > > > >> > > > > > > > >>> wrote:
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> > > > Hi Kafka,
> > > > > > >> > > > > > > > >>> > > >
> > > > > > >> > > > > > > > >>> > > > This KIP tries to address GDPR concern
> > to
> > > > > > fulfill
> > > > > > >> > > > deletion
> > > > > > >> > > > > > > > request
> > > > > > >> > > > > > > > >>> on
> > > > > > >> > > > > > > > >>> > > time
> > > > > > >> > > > > > > > >>> > > > through time-based log compaction on a
> > > > > > compaction
> > > > > > >> > > enabled
> > > > > > >> > > > > > > topic:
> > > > > > >> > > > > > > > >>> > > >
> > > > > > >> > > > > > > > >>> > > >
> > > > > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> > > > > > >> > > > > > > > <
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>>
> > > > > > >> > > > > > > > >>> > > >
> 354%3A+Time-based+log+compaction+policy
> > > > > > >> > > > > > > > >>> > > >
> > > > > > >> > > > > > > > >>> > > > Any feedback will be appreciated.
> > > > > > >> > > > > > > > >>> > > >
> > > > > > >> > > > > > > > >>> > > >
> > > > > > >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
> > > > > > >> > > > > > > > >>> > > >
> > > > > > >> > > > > > > > >>> > >
> > > > > > >> > > > > > > > >>> >
> > > > > > >> > > > > > > > >>>
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >> --
> > > > > > >> > > > > > > > >> Xiongqi (Wesley) Wu
> > > > > > >> > > > > > > > >>
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > --
> > > > > > >> > > > > > > > > Xiongqi (Wesley) Wu
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > --
> > > > > > >> > > > > > > > Xiongqi (Wesley) Wu
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > --
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Brett Rann
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Senior DevOps Engineer
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Zendesk International Ltd
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Mobile: +61 (0) 418 826 017
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > --
> > > > > > >> > > > > > Xiongqi (Wesley) Wu
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > --
> > > > > > >> > > > Xiongqi (Wesley) Wu
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > Xiongqi (Wesley) Wu
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >>
> > > > > > >> Brett Rann
> > > > > > >>
> > > > > > >> Senior DevOps Engineer
> > > > > > >>
> > > > > > >>
> > > > > > >> Zendesk International Ltd
> > > > > > >>
> > > > > > >> 395 Collins Street, Melbourne VIC 3000 Australia
> > > > > > >>
> > > > > > >> Mobile: +61 (0) 418 826 017
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Xiongqi (Wesley) Wu
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Brett Rann
> > > > >
> > > > > Senior DevOps Engineer
> > > > >
> > > > >
> > > > > Zendesk International Ltd
> > > > >
> > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > > >
> > > > > Mobile: +61 (0) 418 826 017
> > > > >
> > >
> >
> >
> > --
> > -Regards,
> > Mayuresh R. Gharat
> > (862) 250-7125
> >
>

Re: [DISCUSS] KIP-354 Time-based log compaction policy

Posted by xiongqi wu <xi...@gmail.com>.
Mayuresh,

Thanks for the comments.
The requirement is that we need to pick up segments that are older than
maxCompactionLagMs for compaction.
maxCompactionLagMs is an upper-bound, which implies that picking up
segments for compaction earlier doesn't violated the policy.
We use the creation time of a segment as an estimation of its records
arrival time, so these records can be compacted no later than
maxCompactionLagMs.

On the other hand, compaction is an expensive operation, we don't want to
compact the log partition whenever a new segment is sealed.
Therefore, we want to pick up a segment for compaction when the segment is
closed to mandatory max compaction lag (so we use segment creation time as
an estimation.)


Xiongqi (Wesley) Wu


On Mon, Oct 15, 2018 at 5:54 PM Mayuresh Gharat <gh...@gmail.com>
wrote:

> Hi Wesley,
>
> Thanks for the KIP and sorry for being late to the party.
>  I wanted to understand, the scenario you mentioned in Proposed changes :
>
> -
> >
> > Estimate the earliest message timestamp of an un-compacted log segment.
> we
> > only need to estimate earliest message timestamp for un-compacted log
> > segments to ensure timely compaction because the deletion requests that
> > belong to compacted segments have already been processed.
> >
> >    1.
> >
> >    for the first (earliest) log segment:  The estimated earliest
> >    timestamp is set to the timestamp of the first message if timestamp is
> >    present in the message. Otherwise, the estimated earliest timestamp
> is set
> >    to "segment.largestTimestamp - maxSegmentMs”
> >     (segment.largestTimestamp is lastModified time of the log segment or
> max
> >    timestamp we see for the log segment.). In the later case, the actual
> >    timestamp of the first message might be later than the estimation,
> but it
> >    is safe to pick up the log for compaction earlier.
> >
> > When we say "actual timestamp of the first message might be later than
> the
> estimation, but it is safe to pick up the log for compaction earlier.",
> doesn't that violate the assumption that we will consider a segment for
> compaction only if the time of creation the segment has crossed the "now -
> maxCompactionLagMs" ?
>
> Thanks,
>
> Mayuresh
>
> On Mon, Sep 3, 2018 at 7:28 PM Brett Rann <br...@zendesk.com.invalid>
> wrote:
>
> > Might also be worth moving to a vote thread? Discussion seems to have
> gone
> > as far as it can.
> >
> > > On 4 Sep 2018, at 12:08, xiongqi wu <xi...@gmail.com> wrote:
> > >
> > > Brett,
> > >
> > > Yes, I will post PR tomorrow.
> > >
> > > Xiongqi (Wesley) Wu
> > >
> > >
> > > On Sun, Sep 2, 2018 at 6:28 PM Brett Rann <br...@zendesk.com.invalid>
> > wrote:
> > >
> > > > +1 (non-binding) from me on the interface. I'd like to see someone
> > familiar
> > > > with
> > > > the code comment on the approach, and note there's a couple of
> > different
> > > > approaches: what's documented in the KIP, and what Xiaohe Dong was
> > working
> > > > on
> > > > here:
> > > >
> > > >
> >
> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0
> > > >
> > > > If you have code working already Xiongqi Wu could you share a PR? I'd
> > be
> > > > happy
> > > > to start testing.
> > > >
> > > > On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <xi...@gmail.com>
> > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Do you have any additional comments on this KIP?
> > > > >
> > > > >
> > > > > On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <xi...@gmail.com>
> > wrote:
> > > > >
> > > > > > on 2)
> > > > > > The offsetmap is built starting from dirty segment.
> > > > > > The compaction starts from the beginning of the log partition.
> > That's
> > > > how
> > > > > > it ensure the deletion of tomb keys.
> > > > > > I will double check tomorrow.
> > > > > >
> > > > > > Xiongqi (Wesley) Wu
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann
> > <br...@zendesk.com.invalid>
> > > > > > wrote:
> > > > > >
> > > > > >> To just clarify a bit on 1. whether there's an external
> storage/DB
> > > > isn't
> > > > > >> relevant here.
> > > > > >> Compacted topics allow a tombstone record to be sent (a null
> value
> > > > for a
> > > > > >> key) which
> > > > > >> currently will result in old values for that key being deleted
> if
> > some
> > > > > >> conditions are met.
> > > > > >> There are existing controls to make sure the old values will
> stay
> > > > around
> > > > > >> for a minimum
> > > > > >> time at least, but no dedicated control to ensure the tombstone
> > will
> > > > > >> delete
> > > > > >> within a
> > > > > >> maximum time.
> > > > > >>
> > > > > >> One popular reason that maximum time for deletion is desirable
> > right
> > > > now
> > > > > >> is
> > > > > >> GDPR with
> > > > > >> PII. But we're not proposing any GDPR awareness in kafka, just
> > being
> > > > > able
> > > > > >> to guarantee
> > > > > >> a max time where a tombstoned key will be removed from the
> > compacted
> > > > > >> topic.
> > > > > >>
> > > > > >> on 2)
> > > > > >> huh, i thought it kept track of the first dirty segment and
> didn't
> > > > > >> recompact older "clean" ones.
> > > > > >> But I didn't look at code or test for that.
> > > > > >>
> > > > > >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <
> xiongqiwu@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >> > 1, Owner of data (in this sense, kafka is the not the owner of
> > data)
> > > > > >> > should keep track of lifecycle of the data in some external
> > > > > storage/DB.
> > > > > >> > The owner determines when to delete the data and send the
> delete
> > > > > >> request to
> > > > > >> > kafka. Kafka doesn't know about the content of data but to
> > provide a
> > > > > >> mean
> > > > > >> > for deletion.
> > > > > >> >
> > > > > >> > 2 , each time compaction runs, it will start from first
> > segments (no
> > > > > >> > matter if it is compacted or not). The time estimation here is
> > only
> > > > > used
> > > > > >> > to determine whether we should run compaction on this log
> > partition.
> > > > > So
> > > > > >> we
> > > > > >> > only need to estimate uncompacted segments.
> > > > > >> >
> > > > > >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <
> lindong28@gmail.com>
> > > > > wrote:
> > > > > >> >
> > > > > >> > > Hey Xiongqi,
> > > > > >> > >
> > > > > >> > > Thanks for the update. I have two questions for the latest
> > KIP.
> > > > > >> > >
> > > > > >> > > 1) The motivation section says that one use case is to
> delete
> > PII
> > > > > >> > (Personal
> > > > > >> > > Identifiable information) data within 7 days while keeping
> > non-PII
> > > > > >> > > indefinitely in compacted format. I suppose the use-case
> > depends
> > > > on
> > > > > >> the
> > > > > >> > > application to determine when to delete those PII data.
> Could
> > you
> > > > > >> explain
> > > > > >> > > how can application reliably determine the set of keys that
> > should
> > > > > be
> > > > > >> > > deleted? Is application required to always messages from the
> > topic
> > > > > >> after
> > > > > >> > > every restart and determine the keys to be deleted by
> looking
> > at
> > > > > >> message
> > > > > >> > > timestamp, or is application supposed to persist the key->
> > > > timstamp
> > > > > >> > > information in a separate persistent storage system?
> > > > > >> > >
> > > > > >> > > 2) It is mentioned in the KIP that "we only need to estimate
> > > > > earliest
> > > > > >> > > message timestamp for un-compacted log segments because the
> > > > deletion
> > > > > >> > > requests that belong to compacted segments have already been
> > > > > >> processed".
> > > > > >> > > Not sure if it is correct. If a segment is compacted before
> > user
> > > > > sends
> > > > > >> > > message to delete a key in this segment, it seems that we
> > still
> > > > need
> > > > > >> to
> > > > > >> > > ensure that the segment will be compacted again within the
> > given
> > > > > time
> > > > > >> > after
> > > > > >> > > the deletion is requested, right?
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Dong
> > > > > >> > >
> > > > > >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <
> > xiongqiwu@gmail.com
> > > > >
> > > > > >> > wrote:
> > > > > >> > >
> > > > > >> > > > Hi Xiaohe,
> > > > > >> > > >
> > > > > >> > > > Quick note:
> > > > > >> > > > 1) Use minimum of segment.ms and max.compaction.lag.ms
> > > > > >> > > > <http://max.compaction.ms
> > > > > <http://max.compaction.ms>
> > > > > >> > <http://max.compaction.ms
> > > > > <http://max.compaction.ms>>>
> > > > > >> > > >
> > > > > >> > > > 2) I am not sure if I get your second question. first, we
> > have
> > > > > >> jitter
> > > > > >> > > when
> > > > > >> > > > we roll the active segment. second, on each compaction, we
> > > > compact
> > > > > >> upto
> > > > > >> > > > the offsetmap could allow. Those will not lead to perfect
> > > > > compaction
> > > > > >> > > storm
> > > > > >> > > > overtime. In addition, I expect we are setting
> > > > > >> max.compaction.lag.ms
> > > > > >> > on
> > > > > >> > > > the order of days.
> > > > > >> > > >
> > > > > >> > > > 3) I don't have access to the confluent community slack
> for
> > > > now. I
> > > > > >> am
> > > > > >> > > > reachable via the google handle out.
> > > > > >> > > > To avoid the double effort, here is my plan:
> > > > > >> > > > a) Collect more feedback and feature requriement on the
> KIP.
> > > > > >> > > > b) Wait unitl this KIP is approved.
> > > > > >> > > > c) I will address any additional requirements in the
> > > > > implementation.
> > > > > >> > (My
> > > > > >> > > > current implementation only complies to whatever described
> > in
> > > > the
> > > > > >> KIP
> > > > > >> > > now)
> > > > > >> > > > d) I can share the code with the you and community see you
> > want
> > > > to
> > > > > >> add
> > > > > >> > > > anything.
> > > > > >> > > > e) submission through committee
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
> > > > > >> dannyrivclo@gmail.com>
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi Xiongqi
> > > > > >> > > > >
> > > > > >> > > > > Thanks for thinking about implementing this as well. :)
> > > > > >> > > > >
> > > > > >> > > > > I was thinking about using `segment.ms` to trigger the
> > > > segment
> > > > > >> roll.
> > > > > >> > > > > Also, its value can be the largest time bias for the
> > record
> > > > > >> deletion.
> > > > > >> > > For
> > > > > >> > > > > example, if the `segment.ms` is 1 day and `
> > max.compaction.ms`
> > > > > is
> > > > > >> 30
> > > > > >> > > > days,
> > > > > >> > > > > the compaction may happen around 31 days.
> > > > > >> > > > >
> > > > > >> > > > > For my curiosity, is there a way we can do some
> > performance
> > > > test
> > > > > >> for
> > > > > >> > > this
> > > > > >> > > > > and any tools you can recommend. As you know,
> previously,
> > it
> > > > is
> > > > > >> > cleaned
> > > > > >> > > > up
> > > > > >> > > > > by respecting dirty ratio, but now it may happen anytime
> > if
> > > > max
> > > > > >> lag
> > > > > >> > has
> > > > > >> > > > > passed for each message. I wonder what would happen if
> > clients
> > > > > >> send
> > > > > >> > > huge
> > > > > >> > > > > amount of tombstone records at the same time.
> > > > > >> > > > >
> > > > > >> > > > > I am looking forward to have a quick chat with you to
> > avoid
> > > > > double
> > > > > >> > > effort
> > > > > >> > > > > on this. I am in confluent community slack during the
> work
> > > > time.
> > > > > >> My
> > > > > >> > > name
> > > > > >> > > > is
> > > > > >> > > > > Xiaohe Dong. :)
> > > > > >> > > > >
> > > > > >> > > > > Rgds
> > > > > >> > > > > Xiaohe Dong
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > On 2018/08/16 01:22:22, xiongqi wu <xiongqiwu@gmail.com
> >
> > > > wrote:
> > > > > >> > > > > > Brett,
> > > > > >> > > > > >
> > > > > >> > > > > > Thank you for your comments.
> > > > > >> > > > > > I was thinking since we already has immediate
> compaction
> > > > > >> setting by
> > > > > >> > > > > setting
> > > > > >> > > > > > min dirty ratio to 0, so I decide to use "0" as
> disabled
> > > > > state.
> > > > > >> > > > > > I am ok to go with -1(disable), 0 (immediate) options.
> > > > > >> > > > > >
> > > > > >> > > > > > For the implementation, there are a few differences
> > between
> > > > > mine
> > > > > >> > and
> > > > > >> > > > > > "Xiaohe Dong"'s :
> > > > > >> > > > > > 1) I used the estimated creation time of a log segment
> > > > instead
> > > > > >> of
> > > > > >> > > > largest
> > > > > >> > > > > > timestamp of a log to determine the compaction
> > eligibility,
> > > > > >> > because a
> > > > > >> > > > log
> > > > > >> > > > > > segment might stay as an active segment up to "max
> > > > compaction
> > > > > >> lag".
> > > > > >> > > > (see
> > > > > >> > > > > > the KIP for detail).
> > > > > >> > > > > > 2) I measure how much bytes that we must clean to
> > follow the
> > > > > >> "max
> > > > > >> > > > > > compaction lag" rule, and use that to determine the
> > order of
> > > > > >> > > > compaction.
> > > > > >> > > > > > 3) force active segment to roll to follow the "max
> > > > compaction
> > > > > >> lag"
> > > > > >> > > > > >
> > > > > >> > > > > > I can share my code so we can coordinate.
> > > > > >> > > > > >
> > > > > >> > > > > > I haven't think about a new API to force a compaction.
> > what
> > > > is
> > > > > >> the
> > > > > >> > > use
> > > > > >> > > > > case
> > > > > >> > > > > > for this one?
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> > > > > >> > > <brann@zendesk.com.invalid
> > > > > >> > > > >
> > > > > >> > > > > > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > We've been looking into this too.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Mailing list:
> > > > > >> > > > > > > https://lists.apache.org/thread.html/
> > > > > <https://lists.apache.org/thread.html/>
> > > > > >> > <https://lists.apache.org/thread.html/
> > > > > <https://lists.apache.org/thread.html/>>
> > > > > >> > > ed7f6a6589f94e8c2a705553f364ef
> > > > > >> > > > > > > 599cb6915e4c3ba9b561e610e4@%3Cdev.kafka.apache.org
> %3E
> > > > > >> > > > > > > jira wish:
> > > > https://issues.apache.org/jira/browse/KAFKA-7137
> > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>
> > > > > >> > <https://issues.apache.org/jira/browse/KAFKA-7137
> > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>>
> > > > > >> > > > > > > confluent slack discussion:
> > > > > >> > > > > > >
> > https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>
> > > > > >> > <https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>>
> > > > > >> > > > > p1530760121000039
> > > > > >> > > > > > >
> > > > > >> > > > > > > A person on my team has started on code so you might
> > want
> > > > to
> > > > > >> > > > > coordinate:
> > > > > >> > > > > > >
> > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
> > > > > >> > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->>
> > > > > >> > > > > > > cleaner-compaction-max-lifetime-2.0
> > > > > >> > > > > > >
> > > > > >> > > > > > > He's been working with Jason Gustafson and James
> Chen
> > > > around
> > > > > >> the
> > > > > >> > > > > changes.
> > > > > >> > > > > > > You can ping him on confluent slack as Xiaohe Dong.
> > > > > >> > > > > > >
> > > > > >> > > > > > > It's great to know others are thinking on it as
> well.
> > > > > >> > > > > > >
> > > > > >> > > > > > > You've added the requirement to force a segment roll
> > which
> > > > > we
> > > > > >> > > hadn't
> > > > > >> > > > > gotten
> > > > > >> > > > > > > to yet, which is great. I was content with it not
> > > > including
> > > > > >> the
> > > > > >> > > > active
> > > > > >> > > > > > > segment.
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Adding topic level configuration "
> > max.compaction.lag.ms
> > > > ",
> > > > > >> and
> > > > > >> > > > > > > corresponding broker configuration "
> > > > > >> > log.cleaner.max.compaction.la
> > > > > >> > > > g.ms
> > > > > >> > > > > ",
> > > > > >> > > > > > > which is set to 0 (disabled) by default.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Glancing at some other settings convention seems to
> > me to
> > > > be
> > > > > >> -1
> > > > > >> > for
> > > > > >> > > > > > > disabled (or infinite, which is more meaningful
> > here). 0
> > > > to
> > > > > me
> > > > > >> > > > implies
> > > > > >> > > > > > > instant, a little quicker than 1.
> > > > > >> > > > > > >
> > > > > >> > > > > > > We've been trying to think about a way to trigger
> > > > compaction
> > > > > >> as
> > > > > >> > > well
> > > > > >> > > > > > > through an API call, which would need to be flagged
> > > > > somewhere
> > > > > >> (ZK
> > > > > >> > > > > admin/
> > > > > >> > > > > > > space?) but we're struggling to think how that would
> > be
> > > > > >> > coordinated
> > > > > >> > > > > across
> > > > > >> > > > > > > brokers and partitions. Have you given any thought
> to
> > > > that?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <
> > > > > >> xiongqiwu@gmail.com>
> > > > > >> > > > > wrote:
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Eno, Dong,
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > I have updated the KIP. We decide not to address
> the
> > > > issue
> > > > > >> that
> > > > > >> > > we
> > > > > >> > > > > might
> > > > > >> > > > > > > > have for both compaction and time retention
> enabled
> > > > topics
> > > > > >> (see
> > > > > >> > > the
> > > > > >> > > > > > > > rejected alternative item 2). This KIP will only
> > ensure
> > > > > log
> > > > > >> can
> > > > > >> > > be
> > > > > >> > > > > > > > compacted after a specified time-interval.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > As suggested by Dong, we will also enforce "
> > > > > >> > > max.compaction.lag.ms"
> > > > > >> > > > > is
> > > > > >> > > > > > > not
> > > > > >> > > > > > > > less than "min.compaction.lag.ms".
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > > > >> > > > > Time-based
> > > > > >> > > > > > > log
> > > > > >> > > > > > > > compaction policy
> > > > > >> > > > > > > > <
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > > > >> > > > > Time-based
> > > > > >> > > > > > > log compaction policy>
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu <
> > > > > >> > xiongqiwu@gmail.com
> > > > > >> > > >
> > > > > >> > > > > wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Per discussion with Dong, he made a very good
> > point
> > > > that
> > > > > >> if
> > > > > >> > > > > compaction
> > > > > >> > > > > > > > > and time based retention are both enabled on a
> > topic,
> > > > > the
> > > > > >> > > > > compaction
> > > > > >> > > > > > > > might
> > > > > >> > > > > > > > > prevent records from being deleted on time. The
> > reason
> > > > > is
> > > > > >> > when
> > > > > >> > > > > > > compacting
> > > > > >> > > > > > > > > multiple segments into one single segment, the
> > newly
> > > > > >> created
> > > > > >> > > > > segment
> > > > > >> > > > > > > will
> > > > > >> > > > > > > > > have same lastmodified timestamp as latest
> > original
> > > > > >> segment.
> > > > > >> > We
> > > > > >> > > > > lose
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > > timestamp of all original segments except the
> last
> > > > one.
> > > > > >> As a
> > > > > >> > > > > result,
> > > > > >> > > > > > > > > records might not be deleted as it should be
> > through
> > > > > time
> > > > > >> > based
> > > > > >> > > > > > > > retention.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > With the current KIP proposal, if we want to
> > ensure
> > > > > timely
> > > > > >> > > > > deletion, we
> > > > > >> > > > > > > > > have the following configurations:
> > > > > >> > > > > > > > > 1) enable time based log compaction only :
> > deletion is
> > > > > >> done
> > > > > >> > > > though
> > > > > >> > > > > > > > > overriding the same key
> > > > > >> > > > > > > > > 2) enable time based log retention only:
> deletion
> > is
> > > > > done
> > > > > >> > > though
> > > > > >> > > > > > > > > time-based retention
> > > > > >> > > > > > > > > 3) enable both log compaction and time based
> > > > retention:
> > > > > >> > > Deletion
> > > > > >> > > > > is not
> > > > > >> > > > > > > > > guaranteed.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Not sure if we have use case 3 and also want
> > deletion
> > > > to
> > > > > >> > happen
> > > > > >> > > > on
> > > > > >> > > > > > > time.
> > > > > >> > > > > > > > > There are several options to address deletion
> > issue
> > > > when
> > > > > >> > enable
> > > > > >> > > > > both
> > > > > >> > > > > > > > > compaction and retention:
> > > > > >> > > > > > > > > A) During log compaction, looking into record
> > > > timestamp
> > > > > to
> > > > > >> > > delete
> > > > > >> > > > > > > expired
> > > > > >> > > > > > > > > records. This can be done in compaction logic
> > itself
> > > > or
> > > > > >> use
> > > > > >> > > > > > > > > AdminClient.deleteRecords() . But this assumes
> we
> > have
> > > > > >> record
> > > > > >> > > > > > > timestamp.
> > > > > >> > > > > > > > > B) retain the lastModifed time of original
> > segments
> > > > > during
> > > > > >> > log
> > > > > >> > > > > > > > compaction.
> > > > > >> > > > > > > > > This requires extra meta data to record the
> > > > information
> > > > > or
> > > > > >> > not
> > > > > >> > > > > grouping
> > > > > >> > > > > > > > > multiple segments into one during compaction.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > If we have use case 3 in general, I would prefer
> > > > > solution
> > > > > >> A
> > > > > >> > and
> > > > > >> > > > > rely on
> > > > > >> > > > > > > > > record timestamp.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Two questions:
> > > > > >> > > > > > > > > Do we have use case 3? Is it nice to have or
> must
> > > > have?
> > > > > >> > > > > > > > > If we have use case 3 and want to go with
> > solution A,
> > > > > >> should
> > > > > >> > we
> > > > > >> > > > > > > introduce
> > > > > >> > > > > > > > > a new configuration to enforce deletion by
> > timestamp?
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi wu <
> > > > > >> > > xiongqiwu@gmail.com
> > > > > >> > > > >
> > > > > >> > > > > > > wrote:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >> Dong,
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Thanks for the comment.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> There are two retention policy: log compaction
> > and
> > > > time
> > > > > >> > based
> > > > > >> > > > > > > retention.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Log compaction:
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> we have use cases to keep infinite retention
> of a
> > > > topic
> > > > > >> > (only
> > > > > >> > > > > > > > >> compaction). GDPR cares about deletion of PII
> > > > (personal
> > > > > >> > > > > identifiable
> > > > > >> > > > > > > > >> information) data.
> > > > > >> > > > > > > > >> Since Kafka doesn't know what records contain
> > PII, it
> > > > > >> relies
> > > > > >> > > on
> > > > > >> > > > > upper
> > > > > >> > > > > > > > >> layer to delete those records.
> > > > > >> > > > > > > > >> For those infinite retention uses uses, kafka
> > needs
> > > > to
> > > > > >> > > provide a
> > > > > >> > > > > way
> > > > > >> > > > > > > to
> > > > > >> > > > > > > > >> enforce compaction on time. This is what we try
> > to
> > > > > >> address
> > > > > >> > in
> > > > > >> > > > this
> > > > > >> > > > > > > KIP.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Time based retention,
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> There are also use cases that users of Kafka
> > might
> > > > want
> > > > > >> to
> > > > > >> > > > expire
> > > > > >> > > > > all
> > > > > >> > > > > > > > >> their data.
> > > > > >> > > > > > > > >> In those cases, they can use time based
> > retention of
> > > > > >> their
> > > > > >> > > > topics.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Regarding your first question, if a user wants
> to
> > > > > delete
> > > > > >> a
> > > > > >> > key
> > > > > >> > > > in
> > > > > >> > > > > the
> > > > > >> > > > > > > > >> log compaction topic, the user has to send a
> > deletion
> > > > > >> using
> > > > > >> > > the
> > > > > >> > > > > same
> > > > > >> > > > > > > > key.
> > > > > >> > > > > > > > >> Kafka only makes sure the deletion will happen
> > under
> > > > a
> > > > > >> > certain
> > > > > >> > > > > time
> > > > > >> > > > > > > > >> periods (like 2 days/7 days).
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Regarding your second question. In most cases,
> we
> > > > might
> > > > > >> want
> > > > > >> > > to
> > > > > >> > > > > delete
> > > > > >> > > > > > > > >> all duplicated keys at the same time.
> > > > > >> > > > > > > > >> Compaction might be more efficient since we
> need
> > to
> > > > > scan
> > > > > >> the
> > > > > >> > > log
> > > > > >> > > > > and
> > > > > >> > > > > > > > find
> > > > > >> > > > > > > > >> all duplicates. However, the expected use case
> > is to
> > > > > set
> > > > > >> the
> > > > > >> > > > time
> > > > > >> > > > > > > based
> > > > > >> > > > > > > > >> compaction interval on the order of days, and
> be
> > > > larger
> > > > > >> than
> > > > > >> > > > 'min
> > > > > >> > > > > > > > >> compaction lag". We don't want log compaction
> to
> > > > happen
> > > > > >> > > > frequently
> > > > > >> > > > > > > since
> > > > > >> > > > > > > > >> it is expensive. The purpose is to help low
> > > > production
> > > > > >> rate
> > > > > >> > > > topic
> > > > > >> > > > > to
> > > > > >> > > > > > > get
> > > > > >> > > > > > > > >> compacted on time. For the topic with "normal"
> > > > incoming
> > > > > >> > > message
> > > > > >> > > > > > > message
> > > > > >> > > > > > > > >> rate, the "min dirty ratio" might have
> triggered
> > the
> > > > > >> > > compaction
> > > > > >> > > > > before
> > > > > >> > > > > > > > this
> > > > > >> > > > > > > > >> time based compaction policy takes effect.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Eno,
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> For your question, like I mentioned we have
> long
> > time
> > > > > >> > > retention
> > > > > >> > > > > use
> > > > > >> > > > > > > case
> > > > > >> > > > > > > > >> for log compacted topic, but we want to provide
> > > > ability
> > > > > >> to
> > > > > >> > > > delete
> > > > > >> > > > > > > > certain
> > > > > >> > > > > > > > >> PII records on time.
> > > > > >> > > > > > > > >> Kafka itself doesn't know whether a record
> > contains
> > > > > >> > sensitive
> > > > > >> > > > > > > > information
> > > > > >> > > > > > > > >> and relies on the user for deletion.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin <
> > > > > >> > > lindong28@gmail.com>
> > > > > >> > > > > > > wrote:
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>> Hey Xiongqi,
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> Thanks for the KIP. I have two questions
> > regarding
> > > > the
> > > > > >> > > use-case
> > > > > >> > > > > for
> > > > > >> > > > > > > > >>> meeting
> > > > > >> > > > > > > > >>> GDPR requirement.
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> 1) If I recall correctly, one of the GDPR
> > > > requirement
> > > > > is
> > > > > >> > that
> > > > > >> > > > we
> > > > > >> > > > > can
> > > > > >> > > > > > > > not
> > > > > >> > > > > > > > >>> keep messages longer than e.g. 30 days in
> > storage
> > > > > (e.g.
> > > > > >> > > Kafka).
> > > > > >> > > > > Say
> > > > > >> > > > > > > > there
> > > > > >> > > > > > > > >>> exists a partition p0 which contains message1
> > with
> > > > > key1
> > > > > >> and
> > > > > >> > > > > message2
> > > > > >> > > > > > > > with
> > > > > >> > > > > > > > >>> key2. And then user keeps producing messages
> > with
> > > > > >> key=key2
> > > > > >> > to
> > > > > >> > > > > this
> > > > > >> > > > > > > > >>> partition. Since message1 with key1 is never
> > > > > overridden,
> > > > > >> > > sooner
> > > > > >> > > > > or
> > > > > >> > > > > > > > later
> > > > > >> > > > > > > > >>> we
> > > > > >> > > > > > > > >>> will want to delete message1 and keep the
> latest
> > > > > message
> > > > > >> > with
> > > > > >> > > > > > > key=key2.
> > > > > >> > > > > > > > >>> But
> > > > > >> > > > > > > > >>> currently it looks like log compact logic in
> > Kafka
> > > > > will
> > > > > >> > > always
> > > > > >> > > > > put
> > > > > >> > > > > > > > these
> > > > > >> > > > > > > > >>> messages in the same segment. Will this be an
> > issue?
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> 2) The current KIP intends to provide the
> > capability
> > > > > to
> > > > > >> > > delete
> > > > > >> > > > a
> > > > > >> > > > > > > given
> > > > > >> > > > > > > > >>> message in log compacted topic. Does such
> > use-case
> > > > > also
> > > > > >> > > require
> > > > > >> > > > > Kafka
> > > > > >> > > > > > > > to
> > > > > >> > > > > > > > >>> keep the messages produced before the given
> > message?
> > > > > If
> > > > > >> > yes,
> > > > > >> > > > > then we
> > > > > >> > > > > > > > can
> > > > > >> > > > > > > > >>> probably just use AdminClient.deleteRecords()
> or
> > > > > >> time-based
> > > > > >> > > log
> > > > > >> > > > > > > > retention
> > > > > >> > > > > > > > >>> to meet the use-case requirement. If no, do
> you
> > know
> > > > > >> what
> > > > > >> > is
> > > > > >> > > > the
> > > > > >> > > > > > > GDPR's
> > > > > >> > > > > > > > >>> requirement on time-to-deletion after user
> > > > explicitly
> > > > > >> > > requests
> > > > > >> > > > > the
> > > > > >> > > > > > > > >>> deletion
> > > > > >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> Thanks,
> > > > > >> > > > > > > > >>> Dong
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi wu <
> > > > > >> > > > xiongqiwu@gmail.com
> > > > > >> > > > > >
> > > > > >> > > > > > > > wrote:
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> > Hi Eno,
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> > The GDPR request we are getting here at
> > linkedin
> > > > is
> > > > > >> if we
> > > > > >> > > > get a
> > > > > >> > > > > > > > >>> request to
> > > > > >> > > > > > > > >>> > delete a record through a null key on a log
> > > > > compacted
> > > > > >> > > topic,
> > > > > >> > > > > > > > >>> > we want to delete the record via compaction
> > in a
> > > > > given
> > > > > >> > time
> > > > > >> > > > > period
> > > > > >> > > > > > > > >>> like 2
> > > > > >> > > > > > > > >>> > days (whatever is required by the policy).
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> > There might be other issues (such as orphan
> > log
> > > > > >> segments
> > > > > >> > > > under
> > > > > >> > > > > > > > certain
> > > > > >> > > > > > > > >>> > conditions) that lead to GDPR problem but
> > they are
> > > > > >> more
> > > > > >> > > like
> > > > > >> > > > > > > > >>> something we
> > > > > >> > > > > > > > >>> > need to fix anyway regardless of GDPR.
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno
> Thereska
> > <
> > > > > >> > > > > > > > eno.thereska@gmail.com>
> > > > > >> > > > > > > > >>> > wrote:
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> > > Hello,
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > Thanks for the KIP. I'd like to see a more
> > > > precise
> > > > > >> > > > > definition of
> > > > > >> > > > > > > > what
> > > > > >> > > > > > > > >>> > part
> > > > > >> > > > > > > > >>> > > of GDPR you are targeting as well as some
> > sort
> > > > of
> > > > > >> > > > > verification
> > > > > >> > > > > > > that
> > > > > >> > > > > > > > >>> this
> > > > > >> > > > > > > > >>> > > KIP actually addresses the problem. Right
> > now I
> > > > > find
> > > > > >> > > this a
> > > > > >> > > > > bit
> > > > > >> > > > > > > > >>> vague:
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > "Ability to delete a log message through
> > > > > compaction
> > > > > >> in
> > > > > >> > a
> > > > > >> > > > > timely
> > > > > >> > > > > > > > >>> manner
> > > > > >> > > > > > > > >>> > has
> > > > > >> > > > > > > > >>> > > become an important requirement in some
> use
> > > > cases
> > > > > >> > (e.g.,
> > > > > >> > > > > GDPR)"
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > Is there any guarantee that after this KIP
> > the
> > > > > GDPR
> > > > > >> > > problem
> > > > > >> > > > > is
> > > > > >> > > > > > > > >>> solved or
> > > > > >> > > > > > > > >>> > do
> > > > > >> > > > > > > > >>> > > we need to do something else as well,
> e.g.,
> > more
> > > > > >> KIPs?
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > Thanks
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > Eno
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM, xiongqi
> wu <
> > > > > >> > > > > xiongqiwu@gmail.com>
> > > > > >> > > > > > > > >>> wrote:
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > > Hi Kafka,
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > > > This KIP tries to address GDPR concern
> to
> > > > > fulfill
> > > > > >> > > > deletion
> > > > > >> > > > > > > > request
> > > > > >> > > > > > > > >>> on
> > > > > >> > > > > > > > >>> > > time
> > > > > >> > > > > > > > >>> > > > through time-based log compaction on a
> > > > > compaction
> > > > > >> > > enabled
> > > > > >> > > > > > > topic:
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> > > > > >> > > > > > > > <
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>>
> > > > > >> > > > > > > > >>> > > > 354%3A+Time-based+log+compaction+policy
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > > > Any feedback will be appreciated.
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> --
> > > > > >> > > > > > > > >> Xiongqi (Wesley) Wu
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > --
> > > > > >> > > > > > > > > Xiongqi (Wesley) Wu
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > --
> > > > > >> > > > > > > > Xiongqi (Wesley) Wu
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > --
> > > > > >> > > > > > >
> > > > > >> > > > > > > Brett Rann
> > > > > >> > > > > > >
> > > > > >> > > > > > > Senior DevOps Engineer
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > Zendesk International Ltd
> > > > > >> > > > > > >
> > > > > >> > > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > > > >> > > > > > >
> > > > > >> > > > > > > Mobile: +61 (0) 418 826 017
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > --
> > > > > >> > > > > > Xiongqi (Wesley) Wu
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > --
> > > > > >> > > > Xiongqi (Wesley) Wu
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > Xiongqi (Wesley) Wu
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >>
> > > > > >> Brett Rann
> > > > > >>
> > > > > >> Senior DevOps Engineer
> > > > > >>
> > > > > >>
> > > > > >> Zendesk International Ltd
> > > > > >>
> > > > > >> 395 Collins Street, Melbourne VIC 3000 Australia
> > > > > >>
> > > > > >> Mobile: +61 (0) 418 826 017
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Xiongqi (Wesley) Wu
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Brett Rann
> > > >
> > > > Senior DevOps Engineer
> > > >
> > > >
> > > > Zendesk International Ltd
> > > >
> > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > >
> > > > Mobile: +61 (0) 418 826 017
> > > >
> >
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>

Re: [DISCUSS] KIP-354 Time-based log compaction policy

Posted by Mayuresh Gharat <gh...@gmail.com>.
Hi Wesley,

Thanks for the KIP and sorry for being late to the party.
 I wanted to understand, the scenario you mentioned in Proposed changes :

-
>
> Estimate the earliest message timestamp of an un-compacted log segment. we
> only need to estimate earliest message timestamp for un-compacted log
> segments to ensure timely compaction because the deletion requests that
> belong to compacted segments have already been processed.
>
>    1.
>
>    for the first (earliest) log segment:  The estimated earliest
>    timestamp is set to the timestamp of the first message if timestamp is
>    present in the message. Otherwise, the estimated earliest timestamp is set
>    to "segment.largestTimestamp - maxSegmentMs”
>     (segment.largestTimestamp is lastModified time of the log segment or max
>    timestamp we see for the log segment.). In the later case, the actual
>    timestamp of the first message might be later than the estimation, but it
>    is safe to pick up the log for compaction earlier.
>
> When we say "actual timestamp of the first message might be later than the
estimation, but it is safe to pick up the log for compaction earlier.",
doesn't that violate the assumption that we will consider a segment for
compaction only if the time of creation the segment has crossed the "now -
maxCompactionLagMs" ?

Thanks,

Mayuresh

On Mon, Sep 3, 2018 at 7:28 PM Brett Rann <br...@zendesk.com.invalid> wrote:

> Might also be worth moving to a vote thread? Discussion seems to have gone
> as far as it can.
>
> > On 4 Sep 2018, at 12:08, xiongqi wu <xi...@gmail.com> wrote:
> >
> > Brett,
> >
> > Yes, I will post PR tomorrow.
> >
> > Xiongqi (Wesley) Wu
> >
> >
> > On Sun, Sep 2, 2018 at 6:28 PM Brett Rann <br...@zendesk.com.invalid>
> wrote:
> >
> > > +1 (non-binding) from me on the interface. I'd like to see someone
> familiar
> > > with
> > > the code comment on the approach, and note there's a couple of
> different
> > > approaches: what's documented in the KIP, and what Xiaohe Dong was
> working
> > > on
> > > here:
> > >
> > >
> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0
> > >
> > > If you have code working already Xiongqi Wu could you share a PR? I'd
> be
> > > happy
> > > to start testing.
> > >
> > > On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <xi...@gmail.com>
> wrote:
> > >
> > > > Hi All,
> > > >
> > > > Do you have any additional comments on this KIP?
> > > >
> > > >
> > > > On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <xi...@gmail.com>
> wrote:
> > > >
> > > > > on 2)
> > > > > The offsetmap is built starting from dirty segment.
> > > > > The compaction starts from the beginning of the log partition.
> That's
> > > how
> > > > > it ensure the deletion of tomb keys.
> > > > > I will double check tomorrow.
> > > > >
> > > > > Xiongqi (Wesley) Wu
> > > > >
> > > > >
> > > > > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann
> <br...@zendesk.com.invalid>
> > > > > wrote:
> > > > >
> > > > >> To just clarify a bit on 1. whether there's an external storage/DB
> > > isn't
> > > > >> relevant here.
> > > > >> Compacted topics allow a tombstone record to be sent (a null value
> > > for a
> > > > >> key) which
> > > > >> currently will result in old values for that key being deleted if
> some
> > > > >> conditions are met.
> > > > >> There are existing controls to make sure the old values will stay
> > > around
> > > > >> for a minimum
> > > > >> time at least, but no dedicated control to ensure the tombstone
> will
> > > > >> delete
> > > > >> within a
> > > > >> maximum time.
> > > > >>
> > > > >> One popular reason that maximum time for deletion is desirable
> right
> > > now
> > > > >> is
> > > > >> GDPR with
> > > > >> PII. But we're not proposing any GDPR awareness in kafka, just
> being
> > > > able
> > > > >> to guarantee
> > > > >> a max time where a tombstoned key will be removed from the
> compacted
> > > > >> topic.
> > > > >>
> > > > >> on 2)
> > > > >> huh, i thought it kept track of the first dirty segment and didn't
> > > > >> recompact older "clean" ones.
> > > > >> But I didn't look at code or test for that.
> > > > >>
> > > > >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <xi...@gmail.com>
> > > > wrote:
> > > > >>
> > > > >> > 1, Owner of data (in this sense, kafka is the not the owner of
> data)
> > > > >> > should keep track of lifecycle of the data in some external
> > > > storage/DB.
> > > > >> > The owner determines when to delete the data and send the delete
> > > > >> request to
> > > > >> > kafka. Kafka doesn't know about the content of data but to
> provide a
> > > > >> mean
> > > > >> > for deletion.
> > > > >> >
> > > > >> > 2 , each time compaction runs, it will start from first
> segments (no
> > > > >> > matter if it is compacted or not). The time estimation here is
> only
> > > > used
> > > > >> > to determine whether we should run compaction on this log
> partition.
> > > > So
> > > > >> we
> > > > >> > only need to estimate uncompacted segments.
> > > > >> >
> > > > >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <li...@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > > Hey Xiongqi,
> > > > >> > >
> > > > >> > > Thanks for the update. I have two questions for the latest
> KIP.
> > > > >> > >
> > > > >> > > 1) The motivation section says that one use case is to delete
> PII
> > > > >> > (Personal
> > > > >> > > Identifiable information) data within 7 days while keeping
> non-PII
> > > > >> > > indefinitely in compacted format. I suppose the use-case
> depends
> > > on
> > > > >> the
> > > > >> > > application to determine when to delete those PII data. Could
> you
> > > > >> explain
> > > > >> > > how can application reliably determine the set of keys that
> should
> > > > be
> > > > >> > > deleted? Is application required to always messages from the
> topic
> > > > >> after
> > > > >> > > every restart and determine the keys to be deleted by looking
> at
> > > > >> message
> > > > >> > > timestamp, or is application supposed to persist the key->
> > > timstamp
> > > > >> > > information in a separate persistent storage system?
> > > > >> > >
> > > > >> > > 2) It is mentioned in the KIP that "we only need to estimate
> > > > earliest
> > > > >> > > message timestamp for un-compacted log segments because the
> > > deletion
> > > > >> > > requests that belong to compacted segments have already been
> > > > >> processed".
> > > > >> > > Not sure if it is correct. If a segment is compacted before
> user
> > > > sends
> > > > >> > > message to delete a key in this segment, it seems that we
> still
> > > need
> > > > >> to
> > > > >> > > ensure that the segment will be compacted again within the
> given
> > > > time
> > > > >> > after
> > > > >> > > the deletion is requested, right?
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Dong
> > > > >> > >
> > > > >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <
> xiongqiwu@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> > >
> > > > >> > > > Hi Xiaohe,
> > > > >> > > >
> > > > >> > > > Quick note:
> > > > >> > > > 1) Use minimum of segment.ms and max.compaction.lag.ms
> > > > >> > > > <http://max.compaction.ms
> > > > <http://max.compaction.ms>
> > > > >> > <http://max.compaction.ms
> > > > <http://max.compaction.ms>>>
> > > > >> > > >
> > > > >> > > > 2) I am not sure if I get your second question. first, we
> have
> > > > >> jitter
> > > > >> > > when
> > > > >> > > > we roll the active segment. second, on each compaction, we
> > > compact
> > > > >> upto
> > > > >> > > > the offsetmap could allow. Those will not lead to perfect
> > > > compaction
> > > > >> > > storm
> > > > >> > > > overtime. In addition, I expect we are setting
> > > > >> max.compaction.lag.ms
> > > > >> > on
> > > > >> > > > the order of days.
> > > > >> > > >
> > > > >> > > > 3) I don't have access to the confluent community slack for
> > > now. I
> > > > >> am
> > > > >> > > > reachable via the google handle out.
> > > > >> > > > To avoid the double effort, here is my plan:
> > > > >> > > > a) Collect more feedback and feature requriement on the KIP.
> > > > >> > > > b) Wait unitl this KIP is approved.
> > > > >> > > > c) I will address any additional requirements in the
> > > > implementation.
> > > > >> > (My
> > > > >> > > > current implementation only complies to whatever described
> in
> > > the
> > > > >> KIP
> > > > >> > > now)
> > > > >> > > > d) I can share the code with the you and community see you
> want
> > > to
> > > > >> add
> > > > >> > > > anything.
> > > > >> > > > e) submission through committee
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
> > > > >> dannyrivclo@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi Xiongqi
> > > > >> > > > >
> > > > >> > > > > Thanks for thinking about implementing this as well. :)
> > > > >> > > > >
> > > > >> > > > > I was thinking about using `segment.ms` to trigger the
> > > segment
> > > > >> roll.
> > > > >> > > > > Also, its value can be the largest time bias for the
> record
> > > > >> deletion.
> > > > >> > > For
> > > > >> > > > > example, if the `segment.ms` is 1 day and `
> max.compaction.ms`
> > > > is
> > > > >> 30
> > > > >> > > > days,
> > > > >> > > > > the compaction may happen around 31 days.
> > > > >> > > > >
> > > > >> > > > > For my curiosity, is there a way we can do some
> performance
> > > test
> > > > >> for
> > > > >> > > this
> > > > >> > > > > and any tools you can recommend. As you know, previously,
> it
> > > is
> > > > >> > cleaned
> > > > >> > > > up
> > > > >> > > > > by respecting dirty ratio, but now it may happen anytime
> if
> > > max
> > > > >> lag
> > > > >> > has
> > > > >> > > > > passed for each message. I wonder what would happen if
> clients
> > > > >> send
> > > > >> > > huge
> > > > >> > > > > amount of tombstone records at the same time.
> > > > >> > > > >
> > > > >> > > > > I am looking forward to have a quick chat with you to
> avoid
> > > > double
> > > > >> > > effort
> > > > >> > > > > on this. I am in confluent community slack during the work
> > > time.
> > > > >> My
> > > > >> > > name
> > > > >> > > > is
> > > > >> > > > > Xiaohe Dong. :)
> > > > >> > > > >
> > > > >> > > > > Rgds
> > > > >> > > > > Xiaohe Dong
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On 2018/08/16 01:22:22, xiongqi wu <xi...@gmail.com>
> > > wrote:
> > > > >> > > > > > Brett,
> > > > >> > > > > >
> > > > >> > > > > > Thank you for your comments.
> > > > >> > > > > > I was thinking since we already has immediate compaction
> > > > >> setting by
> > > > >> > > > > setting
> > > > >> > > > > > min dirty ratio to 0, so I decide to use "0" as disabled
> > > > state.
> > > > >> > > > > > I am ok to go with -1(disable), 0 (immediate) options.
> > > > >> > > > > >
> > > > >> > > > > > For the implementation, there are a few differences
> between
> > > > mine
> > > > >> > and
> > > > >> > > > > > "Xiaohe Dong"'s :
> > > > >> > > > > > 1) I used the estimated creation time of a log segment
> > > instead
> > > > >> of
> > > > >> > > > largest
> > > > >> > > > > > timestamp of a log to determine the compaction
> eligibility,
> > > > >> > because a
> > > > >> > > > log
> > > > >> > > > > > segment might stay as an active segment up to "max
> > > compaction
> > > > >> lag".
> > > > >> > > > (see
> > > > >> > > > > > the KIP for detail).
> > > > >> > > > > > 2) I measure how much bytes that we must clean to
> follow the
> > > > >> "max
> > > > >> > > > > > compaction lag" rule, and use that to determine the
> order of
> > > > >> > > > compaction.
> > > > >> > > > > > 3) force active segment to roll to follow the "max
> > > compaction
> > > > >> lag"
> > > > >> > > > > >
> > > > >> > > > > > I can share my code so we can coordinate.
> > > > >> > > > > >
> > > > >> > > > > > I haven't think about a new API to force a compaction.
> what
> > > is
> > > > >> the
> > > > >> > > use
> > > > >> > > > > case
> > > > >> > > > > > for this one?
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> > > > >> > > <brann@zendesk.com.invalid
> > > > >> > > > >
> > > > >> > > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > We've been looking into this too.
> > > > >> > > > > > >
> > > > >> > > > > > > Mailing list:
> > > > >> > > > > > > https://lists.apache.org/thread.html/
> > > > <https://lists.apache.org/thread.html/>
> > > > >> > <https://lists.apache.org/thread.html/
> > > > <https://lists.apache.org/thread.html/>>
> > > > >> > > ed7f6a6589f94e8c2a705553f364ef
> > > > >> > > > > > > 599cb6915e4c3ba9b561e610e4@%3Cdev.kafka.apache.org%3E
> > > > >> > > > > > > jira wish:
> > > https://issues.apache.org/jira/browse/KAFKA-7137
> > > > <https://issues.apache.org/jira/browse/KAFKA-7137>
> > > > >> > <https://issues.apache.org/jira/browse/KAFKA-7137
> > > > <https://issues.apache.org/jira/browse/KAFKA-7137>>
> > > > >> > > > > > > confluent slack discussion:
> > > > >> > > > > > >
> https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>
> > > > >> > <https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>>
> > > > >> > > > > p1530760121000039
> > > > >> > > > > > >
> > > > >> > > > > > > A person on my team has started on code so you might
> want
> > > to
> > > > >> > > > > coordinate:
> > > > >> > > > > > >
> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
> > > > >> > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->>
> > > > >> > > > > > > cleaner-compaction-max-lifetime-2.0
> > > > >> > > > > > >
> > > > >> > > > > > > He's been working with Jason Gustafson and James Chen
> > > around
> > > > >> the
> > > > >> > > > > changes.
> > > > >> > > > > > > You can ping him on confluent slack as Xiaohe Dong.
> > > > >> > > > > > >
> > > > >> > > > > > > It's great to know others are thinking on it as well.
> > > > >> > > > > > >
> > > > >> > > > > > > You've added the requirement to force a segment roll
> which
> > > > we
> > > > >> > > hadn't
> > > > >> > > > > gotten
> > > > >> > > > > > > to yet, which is great. I was content with it not
> > > including
> > > > >> the
> > > > >> > > > active
> > > > >> > > > > > > segment.
> > > > >> > > > > > >
> > > > >> > > > > > > > Adding topic level configuration "
> max.compaction.lag.ms
> > > ",
> > > > >> and
> > > > >> > > > > > > corresponding broker configuration "
> > > > >> > log.cleaner.max.compaction.la
> > > > >> > > > g.ms
> > > > >> > > > > ",
> > > > >> > > > > > > which is set to 0 (disabled) by default.
> > > > >> > > > > > >
> > > > >> > > > > > > Glancing at some other settings convention seems to
> me to
> > > be
> > > > >> -1
> > > > >> > for
> > > > >> > > > > > > disabled (or infinite, which is more meaningful
> here). 0
> > > to
> > > > me
> > > > >> > > > implies
> > > > >> > > > > > > instant, a little quicker than 1.
> > > > >> > > > > > >
> > > > >> > > > > > > We've been trying to think about a way to trigger
> > > compaction
> > > > >> as
> > > > >> > > well
> > > > >> > > > > > > through an API call, which would need to be flagged
> > > > somewhere
> > > > >> (ZK
> > > > >> > > > > admin/
> > > > >> > > > > > > space?) but we're struggling to think how that would
> be
> > > > >> > coordinated
> > > > >> > > > > across
> > > > >> > > > > > > brokers and partitions. Have you given any thought to
> > > that?
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <
> > > > >> xiongqiwu@gmail.com>
> > > > >> > > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Eno, Dong,
> > > > >> > > > > > > >
> > > > >> > > > > > > > I have updated the KIP. We decide not to address the
> > > issue
> > > > >> that
> > > > >> > > we
> > > > >> > > > > might
> > > > >> > > > > > > > have for both compaction and time retention enabled
> > > topics
> > > > >> (see
> > > > >> > > the
> > > > >> > > > > > > > rejected alternative item 2). This KIP will only
> ensure
> > > > log
> > > > >> can
> > > > >> > > be
> > > > >> > > > > > > > compacted after a specified time-interval.
> > > > >> > > > > > > >
> > > > >> > > > > > > > As suggested by Dong, we will also enforce "
> > > > >> > > max.compaction.lag.ms"
> > > > >> > > > > is
> > > > >> > > > > > > not
> > > > >> > > > > > > > less than "min.compaction.lag.ms".
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > > >> > > > > Time-based
> > > > >> > > > > > > log
> > > > >> > > > > > > > compaction policy
> > > > >> > > > > > > > <
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > > >> > > > > Time-based
> > > > >> > > > > > > log compaction policy>
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu <
> > > > >> > xiongqiwu@gmail.com
> > > > >> > > >
> > > > >> > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Per discussion with Dong, he made a very good
> point
> > > that
> > > > >> if
> > > > >> > > > > compaction
> > > > >> > > > > > > > > and time based retention are both enabled on a
> topic,
> > > > the
> > > > >> > > > > compaction
> > > > >> > > > > > > > might
> > > > >> > > > > > > > > prevent records from being deleted on time. The
> reason
> > > > is
> > > > >> > when
> > > > >> > > > > > > compacting
> > > > >> > > > > > > > > multiple segments into one single segment, the
> newly
> > > > >> created
> > > > >> > > > > segment
> > > > >> > > > > > > will
> > > > >> > > > > > > > > have same lastmodified timestamp as latest
> original
> > > > >> segment.
> > > > >> > We
> > > > >> > > > > lose
> > > > >> > > > > > > the
> > > > >> > > > > > > > > timestamp of all original segments except the last
> > > one.
> > > > >> As a
> > > > >> > > > > result,
> > > > >> > > > > > > > > records might not be deleted as it should be
> through
> > > > time
> > > > >> > based
> > > > >> > > > > > > > retention.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > With the current KIP proposal, if we want to
> ensure
> > > > timely
> > > > >> > > > > deletion, we
> > > > >> > > > > > > > > have the following configurations:
> > > > >> > > > > > > > > 1) enable time based log compaction only :
> deletion is
> > > > >> done
> > > > >> > > > though
> > > > >> > > > > > > > > overriding the same key
> > > > >> > > > > > > > > 2) enable time based log retention only: deletion
> is
> > > > done
> > > > >> > > though
> > > > >> > > > > > > > > time-based retention
> > > > >> > > > > > > > > 3) enable both log compaction and time based
> > > retention:
> > > > >> > > Deletion
> > > > >> > > > > is not
> > > > >> > > > > > > > > guaranteed.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Not sure if we have use case 3 and also want
> deletion
> > > to
> > > > >> > happen
> > > > >> > > > on
> > > > >> > > > > > > time.
> > > > >> > > > > > > > > There are several options to address deletion
> issue
> > > when
> > > > >> > enable
> > > > >> > > > > both
> > > > >> > > > > > > > > compaction and retention:
> > > > >> > > > > > > > > A) During log compaction, looking into record
> > > timestamp
> > > > to
> > > > >> > > delete
> > > > >> > > > > > > expired
> > > > >> > > > > > > > > records. This can be done in compaction logic
> itself
> > > or
> > > > >> use
> > > > >> > > > > > > > > AdminClient.deleteRecords() . But this assumes we
> have
> > > > >> record
> > > > >> > > > > > > timestamp.
> > > > >> > > > > > > > > B) retain the lastModifed time of original
> segments
> > > > during
> > > > >> > log
> > > > >> > > > > > > > compaction.
> > > > >> > > > > > > > > This requires extra meta data to record the
> > > information
> > > > or
> > > > >> > not
> > > > >> > > > > grouping
> > > > >> > > > > > > > > multiple segments into one during compaction.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > If we have use case 3 in general, I would prefer
> > > > solution
> > > > >> A
> > > > >> > and
> > > > >> > > > > rely on
> > > > >> > > > > > > > > record timestamp.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Two questions:
> > > > >> > > > > > > > > Do we have use case 3? Is it nice to have or must
> > > have?
> > > > >> > > > > > > > > If we have use case 3 and want to go with
> solution A,
> > > > >> should
> > > > >> > we
> > > > >> > > > > > > introduce
> > > > >> > > > > > > > > a new configuration to enforce deletion by
> timestamp?
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi wu <
> > > > >> > > xiongqiwu@gmail.com
> > > > >> > > > >
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >> Dong,
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> Thanks for the comment.
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> There are two retention policy: log compaction
> and
> > > time
> > > > >> > based
> > > > >> > > > > > > retention.
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> Log compaction:
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> we have use cases to keep infinite retention of a
> > > topic
> > > > >> > (only
> > > > >> > > > > > > > >> compaction). GDPR cares about deletion of PII
> > > (personal
> > > > >> > > > > identifiable
> > > > >> > > > > > > > >> information) data.
> > > > >> > > > > > > > >> Since Kafka doesn't know what records contain
> PII, it
> > > > >> relies
> > > > >> > > on
> > > > >> > > > > upper
> > > > >> > > > > > > > >> layer to delete those records.
> > > > >> > > > > > > > >> For those infinite retention uses uses, kafka
> needs
> > > to
> > > > >> > > provide a
> > > > >> > > > > way
> > > > >> > > > > > > to
> > > > >> > > > > > > > >> enforce compaction on time. This is what we try
> to
> > > > >> address
> > > > >> > in
> > > > >> > > > this
> > > > >> > > > > > > KIP.
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> Time based retention,
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> There are also use cases that users of Kafka
> might
> > > want
> > > > >> to
> > > > >> > > > expire
> > > > >> > > > > all
> > > > >> > > > > > > > >> their data.
> > > > >> > > > > > > > >> In those cases, they can use time based
> retention of
> > > > >> their
> > > > >> > > > topics.
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> Regarding your first question, if a user wants to
> > > > delete
> > > > >> a
> > > > >> > key
> > > > >> > > > in
> > > > >> > > > > the
> > > > >> > > > > > > > >> log compaction topic, the user has to send a
> deletion
> > > > >> using
> > > > >> > > the
> > > > >> > > > > same
> > > > >> > > > > > > > key.
> > > > >> > > > > > > > >> Kafka only makes sure the deletion will happen
> under
> > > a
> > > > >> > certain
> > > > >> > > > > time
> > > > >> > > > > > > > >> periods (like 2 days/7 days).
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> Regarding your second question. In most cases, we
> > > might
> > > > >> want
> > > > >> > > to
> > > > >> > > > > delete
> > > > >> > > > > > > > >> all duplicated keys at the same time.
> > > > >> > > > > > > > >> Compaction might be more efficient since we need
> to
> > > > scan
> > > > >> the
> > > > >> > > log
> > > > >> > > > > and
> > > > >> > > > > > > > find
> > > > >> > > > > > > > >> all duplicates. However, the expected use case
> is to
> > > > set
> > > > >> the
> > > > >> > > > time
> > > > >> > > > > > > based
> > > > >> > > > > > > > >> compaction interval on the order of days, and be
> > > larger
> > > > >> than
> > > > >> > > > 'min
> > > > >> > > > > > > > >> compaction lag". We don't want log compaction to
> > > happen
> > > > >> > > > frequently
> > > > >> > > > > > > since
> > > > >> > > > > > > > >> it is expensive. The purpose is to help low
> > > production
> > > > >> rate
> > > > >> > > > topic
> > > > >> > > > > to
> > > > >> > > > > > > get
> > > > >> > > > > > > > >> compacted on time. For the topic with "normal"
> > > incoming
> > > > >> > > message
> > > > >> > > > > > > message
> > > > >> > > > > > > > >> rate, the "min dirty ratio" might have triggered
> the
> > > > >> > > compaction
> > > > >> > > > > before
> > > > >> > > > > > > > this
> > > > >> > > > > > > > >> time based compaction policy takes effect.
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> Eno,
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> For your question, like I mentioned we have long
> time
> > > > >> > > retention
> > > > >> > > > > use
> > > > >> > > > > > > case
> > > > >> > > > > > > > >> for log compacted topic, but we want to provide
> > > ability
> > > > >> to
> > > > >> > > > delete
> > > > >> > > > > > > > certain
> > > > >> > > > > > > > >> PII records on time.
> > > > >> > > > > > > > >> Kafka itself doesn't know whether a record
> contains
> > > > >> > sensitive
> > > > >> > > > > > > > information
> > > > >> > > > > > > > >> and relies on the user for deletion.
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin <
> > > > >> > > lindong28@gmail.com>
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >>> Hey Xiongqi,
> > > > >> > > > > > > > >>>
> > > > >> > > > > > > > >>> Thanks for the KIP. I have two questions
> regarding
> > > the
> > > > >> > > use-case
> > > > >> > > > > for
> > > > >> > > > > > > > >>> meeting
> > > > >> > > > > > > > >>> GDPR requirement.
> > > > >> > > > > > > > >>>
> > > > >> > > > > > > > >>> 1) If I recall correctly, one of the GDPR
> > > requirement
> > > > is
> > > > >> > that
> > > > >> > > > we
> > > > >> > > > > can
> > > > >> > > > > > > > not
> > > > >> > > > > > > > >>> keep messages longer than e.g. 30 days in
> storage
> > > > (e.g.
> > > > >> > > Kafka).
> > > > >> > > > > Say
> > > > >> > > > > > > > there
> > > > >> > > > > > > > >>> exists a partition p0 which contains message1
> with
> > > > key1
> > > > >> and
> > > > >> > > > > message2
> > > > >> > > > > > > > with
> > > > >> > > > > > > > >>> key2. And then user keeps producing messages
> with
> > > > >> key=key2
> > > > >> > to
> > > > >> > > > > this
> > > > >> > > > > > > > >>> partition. Since message1 with key1 is never
> > > > overridden,
> > > > >> > > sooner
> > > > >> > > > > or
> > > > >> > > > > > > > later
> > > > >> > > > > > > > >>> we
> > > > >> > > > > > > > >>> will want to delete message1 and keep the latest
> > > > message
> > > > >> > with
> > > > >> > > > > > > key=key2.
> > > > >> > > > > > > > >>> But
> > > > >> > > > > > > > >>> currently it looks like log compact logic in
> Kafka
> > > > will
> > > > >> > > always
> > > > >> > > > > put
> > > > >> > > > > > > > these
> > > > >> > > > > > > > >>> messages in the same segment. Will this be an
> issue?
> > > > >> > > > > > > > >>>
> > > > >> > > > > > > > >>> 2) The current KIP intends to provide the
> capability
> > > > to
> > > > >> > > delete
> > > > >> > > > a
> > > > >> > > > > > > given
> > > > >> > > > > > > > >>> message in log compacted topic. Does such
> use-case
> > > > also
> > > > >> > > require
> > > > >> > > > > Kafka
> > > > >> > > > > > > > to
> > > > >> > > > > > > > >>> keep the messages produced before the given
> message?
> > > > If
> > > > >> > yes,
> > > > >> > > > > then we
> > > > >> > > > > > > > can
> > > > >> > > > > > > > >>> probably just use AdminClient.deleteRecords() or
> > > > >> time-based
> > > > >> > > log
> > > > >> > > > > > > > retention
> > > > >> > > > > > > > >>> to meet the use-case requirement. If no, do you
> know
> > > > >> what
> > > > >> > is
> > > > >> > > > the
> > > > >> > > > > > > GDPR's
> > > > >> > > > > > > > >>> requirement on time-to-deletion after user
> > > explicitly
> > > > >> > > requests
> > > > >> > > > > the
> > > > >> > > > > > > > >>> deletion
> > > > >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> > > > >> > > > > > > > >>>
> > > > >> > > > > > > > >>> Thanks,
> > > > >> > > > > > > > >>> Dong
> > > > >> > > > > > > > >>>
> > > > >> > > > > > > > >>>
> > > > >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi wu <
> > > > >> > > > xiongqiwu@gmail.com
> > > > >> > > > > >
> > > > >> > > > > > > > wrote:
> > > > >> > > > > > > > >>>
> > > > >> > > > > > > > >>> > Hi Eno,
> > > > >> > > > > > > > >>> >
> > > > >> > > > > > > > >>> > The GDPR request we are getting here at
> linkedin
> > > is
> > > > >> if we
> > > > >> > > > get a
> > > > >> > > > > > > > >>> request to
> > > > >> > > > > > > > >>> > delete a record through a null key on a log
> > > > compacted
> > > > >> > > topic,
> > > > >> > > > > > > > >>> > we want to delete the record via compaction
> in a
> > > > given
> > > > >> > time
> > > > >> > > > > period
> > > > >> > > > > > > > >>> like 2
> > > > >> > > > > > > > >>> > days (whatever is required by the policy).
> > > > >> > > > > > > > >>> >
> > > > >> > > > > > > > >>> > There might be other issues (such as orphan
> log
> > > > >> segments
> > > > >> > > > under
> > > > >> > > > > > > > certain
> > > > >> > > > > > > > >>> > conditions) that lead to GDPR problem but
> they are
> > > > >> more
> > > > >> > > like
> > > > >> > > > > > > > >>> something we
> > > > >> > > > > > > > >>> > need to fix anyway regardless of GDPR.
> > > > >> > > > > > > > >>> >
> > > > >> > > > > > > > >>> >
> > > > >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
> > > > >> > > > > > > > >>> >
> > > > >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno Thereska
> <
> > > > >> > > > > > > > eno.thereska@gmail.com>
> > > > >> > > > > > > > >>> > wrote:
> > > > >> > > > > > > > >>> >
> > > > >> > > > > > > > >>> > > Hello,
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > > Thanks for the KIP. I'd like to see a more
> > > precise
> > > > >> > > > > definition of
> > > > >> > > > > > > > what
> > > > >> > > > > > > > >>> > part
> > > > >> > > > > > > > >>> > > of GDPR you are targeting as well as some
> sort
> > > of
> > > > >> > > > > verification
> > > > >> > > > > > > that
> > > > >> > > > > > > > >>> this
> > > > >> > > > > > > > >>> > > KIP actually addresses the problem. Right
> now I
> > > > find
> > > > >> > > this a
> > > > >> > > > > bit
> > > > >> > > > > > > > >>> vague:
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > > "Ability to delete a log message through
> > > > compaction
> > > > >> in
> > > > >> > a
> > > > >> > > > > timely
> > > > >> > > > > > > > >>> manner
> > > > >> > > > > > > > >>> > has
> > > > >> > > > > > > > >>> > > become an important requirement in some use
> > > cases
> > > > >> > (e.g.,
> > > > >> > > > > GDPR)"
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > > Is there any guarantee that after this KIP
> the
> > > > GDPR
> > > > >> > > problem
> > > > >> > > > > is
> > > > >> > > > > > > > >>> solved or
> > > > >> > > > > > > > >>> > do
> > > > >> > > > > > > > >>> > > we need to do something else as well, e.g.,
> more
> > > > >> KIPs?
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > > Thanks
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > > Eno
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM, xiongqi wu <
> > > > >> > > > > xiongqiwu@gmail.com>
> > > > >> > > > > > > > >>> wrote:
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> > > > Hi Kafka,
> > > > >> > > > > > > > >>> > > >
> > > > >> > > > > > > > >>> > > > This KIP tries to address GDPR concern to
> > > > fulfill
> > > > >> > > > deletion
> > > > >> > > > > > > > request
> > > > >> > > > > > > > >>> on
> > > > >> > > > > > > > >>> > > time
> > > > >> > > > > > > > >>> > > > through time-based log compaction on a
> > > > compaction
> > > > >> > > enabled
> > > > >> > > > > > > topic:
> > > > >> > > > > > > > >>> > > >
> > > > >> > > > > > > > >>> > > >
> > > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> > > > >> > > > > > > > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>>
> > > > >> > > > > > > > >>> > > > 354%3A+Time-based+log+compaction+policy
> > > > >> > > > > > > > >>> > > >
> > > > >> > > > > > > > >>> > > > Any feedback will be appreciated.
> > > > >> > > > > > > > >>> > > >
> > > > >> > > > > > > > >>> > > >
> > > > >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
> > > > >> > > > > > > > >>> > > >
> > > > >> > > > > > > > >>> > >
> > > > >> > > > > > > > >>> >
> > > > >> > > > > > > > >>>
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> --
> > > > >> > > > > > > > >> Xiongqi (Wesley) Wu
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > --
> > > > >> > > > > > > > > Xiongqi (Wesley) Wu
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > --
> > > > >> > > > > > > > Xiongqi (Wesley) Wu
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > --
> > > > >> > > > > > >
> > > > >> > > > > > > Brett Rann
> > > > >> > > > > > >
> > > > >> > > > > > > Senior DevOps Engineer
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > Zendesk International Ltd
> > > > >> > > > > > >
> > > > >> > > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > > >> > > > > > >
> > > > >> > > > > > > Mobile: +61 (0) 418 826 017
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > --
> > > > >> > > > > > Xiongqi (Wesley) Wu
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > > Xiongqi (Wesley) Wu
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Xiongqi (Wesley) Wu
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >>
> > > > >> Brett Rann
> > > > >>
> > > > >> Senior DevOps Engineer
> > > > >>
> > > > >>
> > > > >> Zendesk International Ltd
> > > > >>
> > > > >> 395 Collins Street, Melbourne VIC 3000 Australia
> > > > >>
> > > > >> Mobile: +61 (0) 418 826 017
> > > > >>
> > > > >
> > > >
> > > >
> > > > --
> > > > Xiongqi (Wesley) Wu
> > > >
> > >
> > >
> > > --
> > >
> > > Brett Rann
> > >
> > > Senior DevOps Engineer
> > >
> > >
> > > Zendesk International Ltd
> > >
> > > 395 Collins Street, Melbourne VIC 3000 Australia
> > >
> > > Mobile: +61 (0) 418 826 017
> > >
>


-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125

Re: [DISCUSS] KIP-354 Time-based log compaction policy

Posted by Brett Rann <br...@zendesk.com.INVALID>.
Might also be worth moving to a vote thread? Discussion seems to have gone as far as it can. 

> On 4 Sep 2018, at 12:08, xiongqi wu <xi...@gmail.com> wrote:
> 
> Brett,
> 
> Yes, I will post PR tomorrow.
> 
> Xiongqi (Wesley) Wu
> 
> 
> On Sun, Sep 2, 2018 at 6:28 PM Brett Rann <br...@zendesk.com.invalid> wrote:
> 
> > +1 (non-binding) from me on the interface. I'd like to see someone familiar
> > with
> > the code comment on the approach, and note there's a couple of different
> > approaches: what's documented in the KIP, and what Xiaohe Dong was working
> > on
> > here:
> >
> > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0
> >
> > If you have code working already Xiongqi Wu could you share a PR? I'd be
> > happy
> > to start testing.
> >
> > On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <xi...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > Do you have any additional comments on this KIP?
> > >
> > >
> > > On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <xi...@gmail.com> wrote:
> > >
> > > > on 2)
> > > > The offsetmap is built starting from dirty segment.
> > > > The compaction starts from the beginning of the log partition. That's
> > how
> > > > it ensure the deletion of tomb keys.
> > > > I will double check tomorrow.
> > > >
> > > > Xiongqi (Wesley) Wu
> > > >
> > > >
> > > > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann <br...@zendesk.com.invalid>
> > > > wrote:
> > > >
> > > >> To just clarify a bit on 1. whether there's an external storage/DB
> > isn't
> > > >> relevant here.
> > > >> Compacted topics allow a tombstone record to be sent (a null value
> > for a
> > > >> key) which
> > > >> currently will result in old values for that key being deleted if some
> > > >> conditions are met.
> > > >> There are existing controls to make sure the old values will stay
> > around
> > > >> for a minimum
> > > >> time at least, but no dedicated control to ensure the tombstone will
> > > >> delete
> > > >> within a
> > > >> maximum time.
> > > >>
> > > >> One popular reason that maximum time for deletion is desirable right
> > now
> > > >> is
> > > >> GDPR with
> > > >> PII. But we're not proposing any GDPR awareness in kafka, just being
> > > able
> > > >> to guarantee
> > > >> a max time where a tombstoned key will be removed from the compacted
> > > >> topic.
> > > >>
> > > >> on 2)
> > > >> huh, i thought it kept track of the first dirty segment and didn't
> > > >> recompact older "clean" ones.
> > > >> But I didn't look at code or test for that.
> > > >>
> > > >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <xi...@gmail.com>
> > > wrote:
> > > >>
> > > >> > 1, Owner of data (in this sense, kafka is the not the owner of data)
> > > >> > should keep track of lifecycle of the data in some external
> > > storage/DB.
> > > >> > The owner determines when to delete the data and send the delete
> > > >> request to
> > > >> > kafka. Kafka doesn't know about the content of data but to provide a
> > > >> mean
> > > >> > for deletion.
> > > >> >
> > > >> > 2 , each time compaction runs, it will start from first segments (no
> > > >> > matter if it is compacted or not). The time estimation here is only
> > > used
> > > >> > to determine whether we should run compaction on this log partition.
> > > So
> > > >> we
> > > >> > only need to estimate uncompacted segments.
> > > >> >
> > > >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <li...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > > Hey Xiongqi,
> > > >> > >
> > > >> > > Thanks for the update. I have two questions for the latest KIP.
> > > >> > >
> > > >> > > 1) The motivation section says that one use case is to delete PII
> > > >> > (Personal
> > > >> > > Identifiable information) data within 7 days while keeping non-PII
> > > >> > > indefinitely in compacted format. I suppose the use-case depends
> > on
> > > >> the
> > > >> > > application to determine when to delete those PII data. Could you
> > > >> explain
> > > >> > > how can application reliably determine the set of keys that should
> > > be
> > > >> > > deleted? Is application required to always messages from the topic
> > > >> after
> > > >> > > every restart and determine the keys to be deleted by looking at
> > > >> message
> > > >> > > timestamp, or is application supposed to persist the key->
> > timstamp
> > > >> > > information in a separate persistent storage system?
> > > >> > >
> > > >> > > 2) It is mentioned in the KIP that "we only need to estimate
> > > earliest
> > > >> > > message timestamp for un-compacted log segments because the
> > deletion
> > > >> > > requests that belong to compacted segments have already been
> > > >> processed".
> > > >> > > Not sure if it is correct. If a segment is compacted before user
> > > sends
> > > >> > > message to delete a key in this segment, it seems that we still
> > need
> > > >> to
> > > >> > > ensure that the segment will be compacted again within the given
> > > time
> > > >> > after
> > > >> > > the deletion is requested, right?
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Dong
> > > >> > >
> > > >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <xiongqiwu@gmail.com
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > Hi Xiaohe,
> > > >> > > >
> > > >> > > > Quick note:
> > > >> > > > 1) Use minimum of segment.ms and max.compaction.lag.ms
> > > >> > > > <http://max.compaction.ms
> > > <http://max.compaction.ms>
> > > >> > <http://max.compaction.ms
> > > <http://max.compaction.ms>>>
> > > >> > > >
> > > >> > > > 2) I am not sure if I get your second question. first, we have
> > > >> jitter
> > > >> > > when
> > > >> > > > we roll the active segment. second, on each compaction, we
> > compact
> > > >> upto
> > > >> > > > the offsetmap could allow. Those will not lead to perfect
> > > compaction
> > > >> > > storm
> > > >> > > > overtime. In addition, I expect we are setting
> > > >> max.compaction.lag.ms
> > > >> > on
> > > >> > > > the order of days.
> > > >> > > >
> > > >> > > > 3) I don't have access to the confluent community slack for
> > now. I
> > > >> am
> > > >> > > > reachable via the google handle out.
> > > >> > > > To avoid the double effort, here is my plan:
> > > >> > > > a) Collect more feedback and feature requriement on the KIP.
> > > >> > > > b) Wait unitl this KIP is approved.
> > > >> > > > c) I will address any additional requirements in the
> > > implementation.
> > > >> > (My
> > > >> > > > current implementation only complies to whatever described in
> > the
> > > >> KIP
> > > >> > > now)
> > > >> > > > d) I can share the code with the you and community see you want
> > to
> > > >> add
> > > >> > > > anything.
> > > >> > > > e) submission through committee
> > > >> > > >
> > > >> > > >
> > > >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
> > > >> dannyrivclo@gmail.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Hi Xiongqi
> > > >> > > > >
> > > >> > > > > Thanks for thinking about implementing this as well. :)
> > > >> > > > >
> > > >> > > > > I was thinking about using `segment.ms` to trigger the
> > segment
> > > >> roll.
> > > >> > > > > Also, its value can be the largest time bias for the record
> > > >> deletion.
> > > >> > > For
> > > >> > > > > example, if the `segment.ms` is 1 day and `max.compaction.ms`
> > > is
> > > >> 30
> > > >> > > > days,
> > > >> > > > > the compaction may happen around 31 days.
> > > >> > > > >
> > > >> > > > > For my curiosity, is there a way we can do some performance
> > test
> > > >> for
> > > >> > > this
> > > >> > > > > and any tools you can recommend. As you know, previously, it
> > is
> > > >> > cleaned
> > > >> > > > up
> > > >> > > > > by respecting dirty ratio, but now it may happen anytime if
> > max
> > > >> lag
> > > >> > has
> > > >> > > > > passed for each message. I wonder what would happen if clients
> > > >> send
> > > >> > > huge
> > > >> > > > > amount of tombstone records at the same time.
> > > >> > > > >
> > > >> > > > > I am looking forward to have a quick chat with you to avoid
> > > double
> > > >> > > effort
> > > >> > > > > on this. I am in confluent community slack during the work
> > time.
> > > >> My
> > > >> > > name
> > > >> > > > is
> > > >> > > > > Xiaohe Dong. :)
> > > >> > > > >
> > > >> > > > > Rgds
> > > >> > > > > Xiaohe Dong
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On 2018/08/16 01:22:22, xiongqi wu <xi...@gmail.com>
> > wrote:
> > > >> > > > > > Brett,
> > > >> > > > > >
> > > >> > > > > > Thank you for your comments.
> > > >> > > > > > I was thinking since we already has immediate compaction
> > > >> setting by
> > > >> > > > > setting
> > > >> > > > > > min dirty ratio to 0, so I decide to use "0" as disabled
> > > state.
> > > >> > > > > > I am ok to go with -1(disable), 0 (immediate) options.
> > > >> > > > > >
> > > >> > > > > > For the implementation, there are a few differences between
> > > mine
> > > >> > and
> > > >> > > > > > "Xiaohe Dong"'s :
> > > >> > > > > > 1) I used the estimated creation time of a log segment
> > instead
> > > >> of
> > > >> > > > largest
> > > >> > > > > > timestamp of a log to determine the compaction eligibility,
> > > >> > because a
> > > >> > > > log
> > > >> > > > > > segment might stay as an active segment up to "max
> > compaction
> > > >> lag".
> > > >> > > > (see
> > > >> > > > > > the KIP for detail).
> > > >> > > > > > 2) I measure how much bytes that we must clean to follow the
> > > >> "max
> > > >> > > > > > compaction lag" rule, and use that to determine the order of
> > > >> > > > compaction.
> > > >> > > > > > 3) force active segment to roll to follow the "max
> > compaction
> > > >> lag"
> > > >> > > > > >
> > > >> > > > > > I can share my code so we can coordinate.
> > > >> > > > > >
> > > >> > > > > > I haven't think about a new API to force a compaction. what
> > is
> > > >> the
> > > >> > > use
> > > >> > > > > case
> > > >> > > > > > for this one?
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> > > >> > > <brann@zendesk.com.invalid
> > > >> > > > >
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > We've been looking into this too.
> > > >> > > > > > >
> > > >> > > > > > > Mailing list:
> > > >> > > > > > > https://lists.apache.org/thread.html/
> > > <https://lists.apache.org/thread.html/>
> > > >> > <https://lists.apache.org/thread.html/
> > > <https://lists.apache.org/thread.html/>>
> > > >> > > ed7f6a6589f94e8c2a705553f364ef
> > > >> > > > > > > 599cb6915e4c3ba9b561e610e4@%3Cdev.kafka.apache.org%3E
> > > >> > > > > > > jira wish:
> > https://issues.apache.org/jira/browse/KAFKA-7137
> > > <https://issues.apache.org/jira/browse/KAFKA-7137>
> > > >> > <https://issues.apache.org/jira/browse/KAFKA-7137
> > > <https://issues.apache.org/jira/browse/KAFKA-7137>>
> > > >> > > > > > > confluent slack discussion:
> > > >> > > > > > > https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>
> > > >> > <https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>>
> > > >> > > > > p1530760121000039
> > > >> > > > > > >
> > > >> > > > > > > A person on my team has started on code so you might want
> > to
> > > >> > > > > coordinate:
> > > >> > > > > > > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
> > > >> > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->>
> > > >> > > > > > > cleaner-compaction-max-lifetime-2.0
> > > >> > > > > > >
> > > >> > > > > > > He's been working with Jason Gustafson and James Chen
> > around
> > > >> the
> > > >> > > > > changes.
> > > >> > > > > > > You can ping him on confluent slack as Xiaohe Dong.
> > > >> > > > > > >
> > > >> > > > > > > It's great to know others are thinking on it as well.
> > > >> > > > > > >
> > > >> > > > > > > You've added the requirement to force a segment roll which
> > > we
> > > >> > > hadn't
> > > >> > > > > gotten
> > > >> > > > > > > to yet, which is great. I was content with it not
> > including
> > > >> the
> > > >> > > > active
> > > >> > > > > > > segment.
> > > >> > > > > > >
> > > >> > > > > > > > Adding topic level configuration "max.compaction.lag.ms
> > ",
> > > >> and
> > > >> > > > > > > corresponding broker configuration "
> > > >> > log.cleaner.max.compaction.la
> > > >> > > > g.ms
> > > >> > > > > ",
> > > >> > > > > > > which is set to 0 (disabled) by default.
> > > >> > > > > > >
> > > >> > > > > > > Glancing at some other settings convention seems to me to
> > be
> > > >> -1
> > > >> > for
> > > >> > > > > > > disabled (or infinite, which is more meaningful here). 0
> > to
> > > me
> > > >> > > > implies
> > > >> > > > > > > instant, a little quicker than 1.
> > > >> > > > > > >
> > > >> > > > > > > We've been trying to think about a way to trigger
> > compaction
> > > >> as
> > > >> > > well
> > > >> > > > > > > through an API call, which would need to be flagged
> > > somewhere
> > > >> (ZK
> > > >> > > > > admin/
> > > >> > > > > > > space?) but we're struggling to think how that would be
> > > >> > coordinated
> > > >> > > > > across
> > > >> > > > > > > brokers and partitions. Have you given any thought to
> > that?
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <
> > > >> xiongqiwu@gmail.com>
> > > >> > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Eno, Dong,
> > > >> > > > > > > >
> > > >> > > > > > > > I have updated the KIP. We decide not to address the
> > issue
> > > >> that
> > > >> > > we
> > > >> > > > > might
> > > >> > > > > > > > have for both compaction and time retention enabled
> > topics
> > > >> (see
> > > >> > > the
> > > >> > > > > > > > rejected alternative item 2). This KIP will only ensure
> > > log
> > > >> can
> > > >> > > be
> > > >> > > > > > > > compacted after a specified time-interval.
> > > >> > > > > > > >
> > > >> > > > > > > > As suggested by Dong, we will also enforce "
> > > >> > > max.compaction.lag.ms"
> > > >> > > > > is
> > > >> > > > > > > not
> > > >> > > > > > > > less than "min.compaction.lag.ms".
> > > >> > > > > > > >
> > > >> > > > > > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > >> > > > > Time-based
> > > >> > > > > > > log
> > > >> > > > > > > > compaction policy
> > > >> > > > > > > > <
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > >> > > > > Time-based
> > > >> > > > > > > log compaction policy>
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu <
> > > >> > xiongqiwu@gmail.com
> > > >> > > >
> > > >> > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > Per discussion with Dong, he made a very good point
> > that
> > > >> if
> > > >> > > > > compaction
> > > >> > > > > > > > > and time based retention are both enabled on a topic,
> > > the
> > > >> > > > > compaction
> > > >> > > > > > > > might
> > > >> > > > > > > > > prevent records from being deleted on time. The reason
> > > is
> > > >> > when
> > > >> > > > > > > compacting
> > > >> > > > > > > > > multiple segments into one single segment, the newly
> > > >> created
> > > >> > > > > segment
> > > >> > > > > > > will
> > > >> > > > > > > > > have same lastmodified timestamp as latest original
> > > >> segment.
> > > >> > We
> > > >> > > > > lose
> > > >> > > > > > > the
> > > >> > > > > > > > > timestamp of all original segments except the last
> > one.
> > > >> As a
> > > >> > > > > result,
> > > >> > > > > > > > > records might not be deleted as it should be through
> > > time
> > > >> > based
> > > >> > > > > > > > retention.
> > > >> > > > > > > > >
> > > >> > > > > > > > > With the current KIP proposal, if we want to ensure
> > > timely
> > > >> > > > > deletion, we
> > > >> > > > > > > > > have the following configurations:
> > > >> > > > > > > > > 1) enable time based log compaction only : deletion is
> > > >> done
> > > >> > > > though
> > > >> > > > > > > > > overriding the same key
> > > >> > > > > > > > > 2) enable time based log retention only: deletion is
> > > done
> > > >> > > though
> > > >> > > > > > > > > time-based retention
> > > >> > > > > > > > > 3) enable both log compaction and time based
> > retention:
> > > >> > > Deletion
> > > >> > > > > is not
> > > >> > > > > > > > > guaranteed.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Not sure if we have use case 3 and also want deletion
> > to
> > > >> > happen
> > > >> > > > on
> > > >> > > > > > > time.
> > > >> > > > > > > > > There are several options to address deletion issue
> > when
> > > >> > enable
> > > >> > > > > both
> > > >> > > > > > > > > compaction and retention:
> > > >> > > > > > > > > A) During log compaction, looking into record
> > timestamp
> > > to
> > > >> > > delete
> > > >> > > > > > > expired
> > > >> > > > > > > > > records. This can be done in compaction logic itself
> > or
> > > >> use
> > > >> > > > > > > > > AdminClient.deleteRecords() . But this assumes we have
> > > >> record
> > > >> > > > > > > timestamp.
> > > >> > > > > > > > > B) retain the lastModifed time of original segments
> > > during
> > > >> > log
> > > >> > > > > > > > compaction.
> > > >> > > > > > > > > This requires extra meta data to record the
> > information
> > > or
> > > >> > not
> > > >> > > > > grouping
> > > >> > > > > > > > > multiple segments into one during compaction.
> > > >> > > > > > > > >
> > > >> > > > > > > > > If we have use case 3 in general, I would prefer
> > > solution
> > > >> A
> > > >> > and
> > > >> > > > > rely on
> > > >> > > > > > > > > record timestamp.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > Two questions:
> > > >> > > > > > > > > Do we have use case 3? Is it nice to have or must
> > have?
> > > >> > > > > > > > > If we have use case 3 and want to go with solution A,
> > > >> should
> > > >> > we
> > > >> > > > > > > introduce
> > > >> > > > > > > > > a new configuration to enforce deletion by timestamp?
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi wu <
> > > >> > > xiongqiwu@gmail.com
> > > >> > > > >
> > > >> > > > > > > wrote:
> > > >> > > > > > > > >
> > > >> > > > > > > > >> Dong,
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> Thanks for the comment.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> There are two retention policy: log compaction and
> > time
> > > >> > based
> > > >> > > > > > > retention.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> Log compaction:
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> we have use cases to keep infinite retention of a
> > topic
> > > >> > (only
> > > >> > > > > > > > >> compaction). GDPR cares about deletion of PII
> > (personal
> > > >> > > > > identifiable
> > > >> > > > > > > > >> information) data.
> > > >> > > > > > > > >> Since Kafka doesn't know what records contain PII, it
> > > >> relies
> > > >> > > on
> > > >> > > > > upper
> > > >> > > > > > > > >> layer to delete those records.
> > > >> > > > > > > > >> For those infinite retention uses uses, kafka needs
> > to
> > > >> > > provide a
> > > >> > > > > way
> > > >> > > > > > > to
> > > >> > > > > > > > >> enforce compaction on time. This is what we try to
> > > >> address
> > > >> > in
> > > >> > > > this
> > > >> > > > > > > KIP.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> Time based retention,
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> There are also use cases that users of Kafka might
> > want
> > > >> to
> > > >> > > > expire
> > > >> > > > > all
> > > >> > > > > > > > >> their data.
> > > >> > > > > > > > >> In those cases, they can use time based retention of
> > > >> their
> > > >> > > > topics.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> Regarding your first question, if a user wants to
> > > delete
> > > >> a
> > > >> > key
> > > >> > > > in
> > > >> > > > > the
> > > >> > > > > > > > >> log compaction topic, the user has to send a deletion
> > > >> using
> > > >> > > the
> > > >> > > > > same
> > > >> > > > > > > > key.
> > > >> > > > > > > > >> Kafka only makes sure the deletion will happen under
> > a
> > > >> > certain
> > > >> > > > > time
> > > >> > > > > > > > >> periods (like 2 days/7 days).
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> Regarding your second question. In most cases, we
> > might
> > > >> want
> > > >> > > to
> > > >> > > > > delete
> > > >> > > > > > > > >> all duplicated keys at the same time.
> > > >> > > > > > > > >> Compaction might be more efficient since we need to
> > > scan
> > > >> the
> > > >> > > log
> > > >> > > > > and
> > > >> > > > > > > > find
> > > >> > > > > > > > >> all duplicates. However, the expected use case is to
> > > set
> > > >> the
> > > >> > > > time
> > > >> > > > > > > based
> > > >> > > > > > > > >> compaction interval on the order of days, and be
> > larger
> > > >> than
> > > >> > > > 'min
> > > >> > > > > > > > >> compaction lag". We don't want log compaction to
> > happen
> > > >> > > > frequently
> > > >> > > > > > > since
> > > >> > > > > > > > >> it is expensive. The purpose is to help low
> > production
> > > >> rate
> > > >> > > > topic
> > > >> > > > > to
> > > >> > > > > > > get
> > > >> > > > > > > > >> compacted on time. For the topic with "normal"
> > incoming
> > > >> > > message
> > > >> > > > > > > message
> > > >> > > > > > > > >> rate, the "min dirty ratio" might have triggered the
> > > >> > > compaction
> > > >> > > > > before
> > > >> > > > > > > > this
> > > >> > > > > > > > >> time based compaction policy takes effect.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> Eno,
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> For your question, like I mentioned we have long time
> > > >> > > retention
> > > >> > > > > use
> > > >> > > > > > > case
> > > >> > > > > > > > >> for log compacted topic, but we want to provide
> > ability
> > > >> to
> > > >> > > > delete
> > > >> > > > > > > > certain
> > > >> > > > > > > > >> PII records on time.
> > > >> > > > > > > > >> Kafka itself doesn't know whether a record contains
> > > >> > sensitive
> > > >> > > > > > > > information
> > > >> > > > > > > > >> and relies on the user for deletion.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin <
> > > >> > > lindong28@gmail.com>
> > > >> > > > > > > wrote:
> > > >> > > > > > > > >>
> > > >> > > > > > > > >>> Hey Xiongqi,
> > > >> > > > > > > > >>>
> > > >> > > > > > > > >>> Thanks for the KIP. I have two questions regarding
> > the
> > > >> > > use-case
> > > >> > > > > for
> > > >> > > > > > > > >>> meeting
> > > >> > > > > > > > >>> GDPR requirement.
> > > >> > > > > > > > >>>
> > > >> > > > > > > > >>> 1) If I recall correctly, one of the GDPR
> > requirement
> > > is
> > > >> > that
> > > >> > > > we
> > > >> > > > > can
> > > >> > > > > > > > not
> > > >> > > > > > > > >>> keep messages longer than e.g. 30 days in storage
> > > (e.g.
> > > >> > > Kafka).
> > > >> > > > > Say
> > > >> > > > > > > > there
> > > >> > > > > > > > >>> exists a partition p0 which contains message1 with
> > > key1
> > > >> and
> > > >> > > > > message2
> > > >> > > > > > > > with
> > > >> > > > > > > > >>> key2. And then user keeps producing messages with
> > > >> key=key2
> > > >> > to
> > > >> > > > > this
> > > >> > > > > > > > >>> partition. Since message1 with key1 is never
> > > overridden,
> > > >> > > sooner
> > > >> > > > > or
> > > >> > > > > > > > later
> > > >> > > > > > > > >>> we
> > > >> > > > > > > > >>> will want to delete message1 and keep the latest
> > > message
> > > >> > with
> > > >> > > > > > > key=key2.
> > > >> > > > > > > > >>> But
> > > >> > > > > > > > >>> currently it looks like log compact logic in Kafka
> > > will
> > > >> > > always
> > > >> > > > > put
> > > >> > > > > > > > these
> > > >> > > > > > > > >>> messages in the same segment. Will this be an issue?
> > > >> > > > > > > > >>>
> > > >> > > > > > > > >>> 2) The current KIP intends to provide the capability
> > > to
> > > >> > > delete
> > > >> > > > a
> > > >> > > > > > > given
> > > >> > > > > > > > >>> message in log compacted topic. Does such use-case
> > > also
> > > >> > > require
> > > >> > > > > Kafka
> > > >> > > > > > > > to
> > > >> > > > > > > > >>> keep the messages produced before the given message?
> > > If
> > > >> > yes,
> > > >> > > > > then we
> > > >> > > > > > > > can
> > > >> > > > > > > > >>> probably just use AdminClient.deleteRecords() or
> > > >> time-based
> > > >> > > log
> > > >> > > > > > > > retention
> > > >> > > > > > > > >>> to meet the use-case requirement. If no, do you know
> > > >> what
> > > >> > is
> > > >> > > > the
> > > >> > > > > > > GDPR's
> > > >> > > > > > > > >>> requirement on time-to-deletion after user
> > explicitly
> > > >> > > requests
> > > >> > > > > the
> > > >> > > > > > > > >>> deletion
> > > >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> > > >> > > > > > > > >>>
> > > >> > > > > > > > >>> Thanks,
> > > >> > > > > > > > >>> Dong
> > > >> > > > > > > > >>>
> > > >> > > > > > > > >>>
> > > >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi wu <
> > > >> > > > xiongqiwu@gmail.com
> > > >> > > > > >
> > > >> > > > > > > > wrote:
> > > >> > > > > > > > >>>
> > > >> > > > > > > > >>> > Hi Eno,
> > > >> > > > > > > > >>> >
> > > >> > > > > > > > >>> > The GDPR request we are getting here at linkedin
> > is
> > > >> if we
> > > >> > > > get a
> > > >> > > > > > > > >>> request to
> > > >> > > > > > > > >>> > delete a record through a null key on a log
> > > compacted
> > > >> > > topic,
> > > >> > > > > > > > >>> > we want to delete the record via compaction in a
> > > given
> > > >> > time
> > > >> > > > > period
> > > >> > > > > > > > >>> like 2
> > > >> > > > > > > > >>> > days (whatever is required by the policy).
> > > >> > > > > > > > >>> >
> > > >> > > > > > > > >>> > There might be other issues (such as orphan log
> > > >> segments
> > > >> > > > under
> > > >> > > > > > > > certain
> > > >> > > > > > > > >>> > conditions) that lead to GDPR problem but they are
> > > >> more
> > > >> > > like
> > > >> > > > > > > > >>> something we
> > > >> > > > > > > > >>> > need to fix anyway regardless of GDPR.
> > > >> > > > > > > > >>> >
> > > >> > > > > > > > >>> >
> > > >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
> > > >> > > > > > > > >>> >
> > > >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno Thereska <
> > > >> > > > > > > > eno.thereska@gmail.com>
> > > >> > > > > > > > >>> > wrote:
> > > >> > > > > > > > >>> >
> > > >> > > > > > > > >>> > > Hello,
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > > Thanks for the KIP. I'd like to see a more
> > precise
> > > >> > > > > definition of
> > > >> > > > > > > > what
> > > >> > > > > > > > >>> > part
> > > >> > > > > > > > >>> > > of GDPR you are targeting as well as some sort
> > of
> > > >> > > > > verification
> > > >> > > > > > > that
> > > >> > > > > > > > >>> this
> > > >> > > > > > > > >>> > > KIP actually addresses the problem. Right now I
> > > find
> > > >> > > this a
> > > >> > > > > bit
> > > >> > > > > > > > >>> vague:
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > > "Ability to delete a log message through
> > > compaction
> > > >> in
> > > >> > a
> > > >> > > > > timely
> > > >> > > > > > > > >>> manner
> > > >> > > > > > > > >>> > has
> > > >> > > > > > > > >>> > > become an important requirement in some use
> > cases
> > > >> > (e.g.,
> > > >> > > > > GDPR)"
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > > Is there any guarantee that after this KIP the
> > > GDPR
> > > >> > > problem
> > > >> > > > > is
> > > >> > > > > > > > >>> solved or
> > > >> > > > > > > > >>> > do
> > > >> > > > > > > > >>> > > we need to do something else as well, e.g., more
> > > >> KIPs?
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > > Thanks
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > > Eno
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM, xiongqi wu <
> > > >> > > > > xiongqiwu@gmail.com>
> > > >> > > > > > > > >>> wrote:
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> > > > Hi Kafka,
> > > >> > > > > > > > >>> > > >
> > > >> > > > > > > > >>> > > > This KIP tries to address GDPR concern to
> > > fulfill
> > > >> > > > deletion
> > > >> > > > > > > > request
> > > >> > > > > > > > >>> on
> > > >> > > > > > > > >>> > > time
> > > >> > > > > > > > >>> > > > through time-based log compaction on a
> > > compaction
> > > >> > > enabled
> > > >> > > > > > > topic:
> > > >> > > > > > > > >>> > > >
> > > >> > > > > > > > >>> > > >
> > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> > > >> > > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>>
> > > >> > > > > > > > >>> > > > 354%3A+Time-based+log+compaction+policy
> > > >> > > > > > > > >>> > > >
> > > >> > > > > > > > >>> > > > Any feedback will be appreciated.
> > > >> > > > > > > > >>> > > >
> > > >> > > > > > > > >>> > > >
> > > >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
> > > >> > > > > > > > >>> > > >
> > > >> > > > > > > > >>> > >
> > > >> > > > > > > > >>> >
> > > >> > > > > > > > >>>
> > > >> > > > > > > > >>
> > > >> > > > > > > > >>
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> --
> > > >> > > > > > > > >> Xiongqi (Wesley) Wu
> > > >> > > > > > > > >>
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > --
> > > >> > > > > > > > > Xiongqi (Wesley) Wu
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > --
> > > >> > > > > > > > Xiongqi (Wesley) Wu
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > --
> > > >> > > > > > >
> > > >> > > > > > > Brett Rann
> > > >> > > > > > >
> > > >> > > > > > > Senior DevOps Engineer
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > Zendesk International Ltd
> > > >> > > > > > >
> > > >> > > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > >> > > > > > >
> > > >> > > > > > > Mobile: +61 (0) 418 826 017
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Xiongqi (Wesley) Wu
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Xiongqi (Wesley) Wu
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Xiongqi (Wesley) Wu
> > > >> >
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >> Brett Rann
> > > >>
> > > >> Senior DevOps Engineer
> > > >>
> > > >>
> > > >> Zendesk International Ltd
> > > >>
> > > >> 395 Collins Street, Melbourne VIC 3000 Australia
> > > >>
> > > >> Mobile: +61 (0) 418 826 017
> > > >>
> > > >
> > >
> > >
> > > --
> > > Xiongqi (Wesley) Wu
> > >
> >
> >
> > --
> >
> > Brett Rann
> >
> > Senior DevOps Engineer
> >
> >
> > Zendesk International Ltd
> >
> > 395 Collins Street, Melbourne VIC 3000 Australia
> >
> > Mobile: +61 (0) 418 826 017
> >

Re: [DISCUSS] KIP-354 Time-based log compaction policy

Posted by xiongqi wu <xi...@gmail.com>.
Brett,

Yes, I will post PR tomorrow.

Xiongqi (Wesley) Wu


On Sun, Sep 2, 2018 at 6:28 PM Brett Rann <br...@zendesk.com.invalid> wrote:

> +1 (non-binding) from me on the interface. I'd like to see someone familiar
> with
> the code comment on the approach, and note there's a couple of different
> approaches: what's documented in the KIP, and what Xiaohe Dong was working
> on
> here:
>
> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0
>
> If you have code working already Xiongqi Wu could you share a PR? I'd be
> happy
> to start testing.
>
> On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <xi...@gmail.com> wrote:
>
> > Hi All,
> >
> > Do you have any additional comments on this KIP?
> >
> >
> > On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <xi...@gmail.com> wrote:
> >
> > > on 2)
> > > The offsetmap is built starting from dirty segment.
> > > The compaction starts from the beginning of the log partition. That's
> how
> > > it ensure the deletion of tomb keys.
> > > I will double check tomorrow.
> > >
> > > Xiongqi (Wesley) Wu
> > >
> > >
> > > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann <br...@zendesk.com.invalid>
> > > wrote:
> > >
> > >> To just clarify a bit on 1. whether there's an external storage/DB
> isn't
> > >> relevant here.
> > >> Compacted topics allow a tombstone record to be sent (a null value
> for a
> > >> key) which
> > >> currently will result in old values for that key being deleted if some
> > >> conditions are met.
> > >> There are existing controls to make sure the old values will stay
> around
> > >> for a minimum
> > >> time at least, but no dedicated control to ensure the tombstone will
> > >> delete
> > >> within a
> > >> maximum time.
> > >>
> > >> One popular reason that maximum time for deletion is desirable right
> now
> > >> is
> > >> GDPR with
> > >> PII. But we're not proposing any GDPR awareness in kafka, just being
> > able
> > >> to guarantee
> > >> a max time where a tombstoned key will be removed from the compacted
> > >> topic.
> > >>
> > >> on 2)
> > >> huh, i thought it kept track of the first dirty segment and didn't
> > >> recompact older "clean" ones.
> > >> But I didn't look at code or test for that.
> > >>
> > >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <xi...@gmail.com>
> > wrote:
> > >>
> > >> > 1, Owner of data (in this sense, kafka is the not the owner of data)
> > >> > should keep track of lifecycle of the data in some external
> > storage/DB.
> > >> > The owner determines when to delete the data and send the delete
> > >> request to
> > >> > kafka. Kafka doesn't know about the content of data but to provide a
> > >> mean
> > >> > for deletion.
> > >> >
> > >> > 2 , each time compaction runs, it will start from first segments (no
> > >> > matter if it is compacted or not). The time estimation here is only
> > used
> > >> > to determine whether we should run compaction on this log partition.
> > So
> > >> we
> > >> > only need to estimate uncompacted segments.
> > >> >
> > >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <li...@gmail.com>
> > wrote:
> > >> >
> > >> > > Hey Xiongqi,
> > >> > >
> > >> > > Thanks for the update. I have two questions for the latest KIP.
> > >> > >
> > >> > > 1) The motivation section says that one use case is to delete PII
> > >> > (Personal
> > >> > > Identifiable information) data within 7 days while keeping non-PII
> > >> > > indefinitely in compacted format. I suppose the use-case depends
> on
> > >> the
> > >> > > application to determine when to delete those PII data. Could you
> > >> explain
> > >> > > how can application reliably determine the set of keys that should
> > be
> > >> > > deleted? Is application required to always messages from the topic
> > >> after
> > >> > > every restart and determine the keys to be deleted by looking at
> > >> message
> > >> > > timestamp, or is application supposed to persist the key->
> timstamp
> > >> > > information in a separate persistent storage system?
> > >> > >
> > >> > > 2) It is mentioned in the KIP that "we only need to estimate
> > earliest
> > >> > > message timestamp for un-compacted log segments because the
> deletion
> > >> > > requests that belong to compacted segments have already been
> > >> processed".
> > >> > > Not sure if it is correct. If a segment is compacted before user
> > sends
> > >> > > message to delete a key in this segment, it seems that we still
> need
> > >> to
> > >> > > ensure that the segment will be compacted again within the given
> > time
> > >> > after
> > >> > > the deletion is requested, right?
> > >> > >
> > >> > > Thanks,
> > >> > > Dong
> > >> > >
> > >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <xiongqiwu@gmail.com
> >
> > >> > wrote:
> > >> > >
> > >> > > > Hi Xiaohe,
> > >> > > >
> > >> > > > Quick note:
> > >> > > > 1) Use minimum of segment.ms and max.compaction.lag.ms
> > >> > > > <http://max.compaction.ms
> > <http://max.compaction.ms>
> > >> > <http://max.compaction.ms
> > <http://max.compaction.ms>>>
> > >> > > >
> > >> > > > 2) I am not sure if I get your second question. first, we have
> > >> jitter
> > >> > > when
> > >> > > > we roll the active segment. second, on each compaction, we
> compact
> > >> upto
> > >> > > > the offsetmap could allow. Those will not lead to perfect
> > compaction
> > >> > > storm
> > >> > > > overtime. In addition, I expect we are setting
> > >> max.compaction.lag.ms
> > >> > on
> > >> > > > the order of days.
> > >> > > >
> > >> > > > 3) I don't have access to the confluent community slack for
> now. I
> > >> am
> > >> > > > reachable via the google handle out.
> > >> > > > To avoid the double effort, here is my plan:
> > >> > > > a) Collect more feedback and feature requriement on the KIP.
> > >> > > > b) Wait unitl this KIP is approved.
> > >> > > > c) I will address any additional requirements in the
> > implementation.
> > >> > (My
> > >> > > > current implementation only complies to whatever described in
> the
> > >> KIP
> > >> > > now)
> > >> > > > d) I can share the code with the you and community see you want
> to
> > >> add
> > >> > > > anything.
> > >> > > > e) submission through committee
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
> > >> dannyrivclo@gmail.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi Xiongqi
> > >> > > > >
> > >> > > > > Thanks for thinking about implementing this as well. :)
> > >> > > > >
> > >> > > > > I was thinking about using `segment.ms` to trigger the
> segment
> > >> roll.
> > >> > > > > Also, its value can be the largest time bias for the record
> > >> deletion.
> > >> > > For
> > >> > > > > example, if the `segment.ms` is 1 day and `max.compaction.ms`
> > is
> > >> 30
> > >> > > > days,
> > >> > > > > the compaction may happen around 31 days.
> > >> > > > >
> > >> > > > > For my curiosity, is there a way we can do some performance
> test
> > >> for
> > >> > > this
> > >> > > > > and any tools you can recommend. As you know, previously, it
> is
> > >> > cleaned
> > >> > > > up
> > >> > > > > by respecting dirty ratio, but now it may happen anytime if
> max
> > >> lag
> > >> > has
> > >> > > > > passed for each message. I wonder what would happen if clients
> > >> send
> > >> > > huge
> > >> > > > > amount of tombstone records at the same time.
> > >> > > > >
> > >> > > > > I am looking forward to have a quick chat with you to avoid
> > double
> > >> > > effort
> > >> > > > > on this. I am in confluent community slack during the work
> time.
> > >> My
> > >> > > name
> > >> > > > is
> > >> > > > > Xiaohe Dong. :)
> > >> > > > >
> > >> > > > > Rgds
> > >> > > > > Xiaohe Dong
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On 2018/08/16 01:22:22, xiongqi wu <xi...@gmail.com>
> wrote:
> > >> > > > > > Brett,
> > >> > > > > >
> > >> > > > > > Thank you for your comments.
> > >> > > > > > I was thinking since we already has immediate compaction
> > >> setting by
> > >> > > > > setting
> > >> > > > > > min dirty ratio to 0, so I decide to use "0" as disabled
> > state.
> > >> > > > > > I am ok to go with -1(disable), 0 (immediate) options.
> > >> > > > > >
> > >> > > > > > For the implementation, there are a few differences between
> > mine
> > >> > and
> > >> > > > > > "Xiaohe Dong"'s :
> > >> > > > > > 1) I used the estimated creation time of a log segment
> instead
> > >> of
> > >> > > > largest
> > >> > > > > > timestamp of a log to determine the compaction eligibility,
> > >> > because a
> > >> > > > log
> > >> > > > > > segment might stay as an active segment up to "max
> compaction
> > >> lag".
> > >> > > > (see
> > >> > > > > > the KIP for detail).
> > >> > > > > > 2) I measure how much bytes that we must clean to follow the
> > >> "max
> > >> > > > > > compaction lag" rule, and use that to determine the order of
> > >> > > > compaction.
> > >> > > > > > 3) force active segment to roll to follow the "max
> compaction
> > >> lag"
> > >> > > > > >
> > >> > > > > > I can share my code so we can coordinate.
> > >> > > > > >
> > >> > > > > > I haven't think about a new API to force a compaction. what
> is
> > >> the
> > >> > > use
> > >> > > > > case
> > >> > > > > > for this one?
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> > >> > > <brann@zendesk.com.invalid
> > >> > > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > We've been looking into this too.
> > >> > > > > > >
> > >> > > > > > > Mailing list:
> > >> > > > > > > https://lists.apache.org/thread.html/
> > <https://lists.apache.org/thread.html/>
> > >> > <https://lists.apache.org/thread.html/
> > <https://lists.apache.org/thread.html/>>
> > >> > > ed7f6a6589f94e8c2a705553f364ef
> > >> > > > > > > 599cb6915e4c3ba9b561e610e4@%3Cdev.kafka.apache.org%3E
> > >> > > > > > > jira wish:
> https://issues.apache.org/jira/browse/KAFKA-7137
> > <https://issues.apache.org/jira/browse/KAFKA-7137>
> > >> > <https://issues.apache.org/jira/browse/KAFKA-7137
> > <https://issues.apache.org/jira/browse/KAFKA-7137>>
> > >> > > > > > > confluent slack discussion:
> > >> > > > > > > https://confluentcommunity.slack.com/archives/C49R61XMM/
> > <https://confluentcommunity.slack.com/archives/C49R61XMM/>
> > >> > <https://confluentcommunity.slack.com/archives/C49R61XMM/
> > <https://confluentcommunity.slack.com/archives/C49R61XMM/>>
> > >> > > > > p1530760121000039
> > >> > > > > > >
> > >> > > > > > > A person on my team has started on code so you might want
> to
> > >> > > > > coordinate:
> > >> > > > > > > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
> > >> > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->>
> > >> > > > > > > cleaner-compaction-max-lifetime-2.0
> > >> > > > > > >
> > >> > > > > > > He's been working with Jason Gustafson and James Chen
> around
> > >> the
> > >> > > > > changes.
> > >> > > > > > > You can ping him on confluent slack as Xiaohe Dong.
> > >> > > > > > >
> > >> > > > > > > It's great to know others are thinking on it as well.
> > >> > > > > > >
> > >> > > > > > > You've added the requirement to force a segment roll which
> > we
> > >> > > hadn't
> > >> > > > > gotten
> > >> > > > > > > to yet, which is great. I was content with it not
> including
> > >> the
> > >> > > > active
> > >> > > > > > > segment.
> > >> > > > > > >
> > >> > > > > > > > Adding topic level configuration "max.compaction.lag.ms
> ",
> > >> and
> > >> > > > > > > corresponding broker configuration "
> > >> > log.cleaner.max.compaction.la
> > >> > > > g.ms
> > >> > > > > ",
> > >> > > > > > > which is set to 0 (disabled) by default.
> > >> > > > > > >
> > >> > > > > > > Glancing at some other settings convention seems to me to
> be
> > >> -1
> > >> > for
> > >> > > > > > > disabled (or infinite, which is more meaningful here). 0
> to
> > me
> > >> > > > implies
> > >> > > > > > > instant, a little quicker than 1.
> > >> > > > > > >
> > >> > > > > > > We've been trying to think about a way to trigger
> compaction
> > >> as
> > >> > > well
> > >> > > > > > > through an API call, which would need to be flagged
> > somewhere
> > >> (ZK
> > >> > > > > admin/
> > >> > > > > > > space?) but we're struggling to think how that would be
> > >> > coordinated
> > >> > > > > across
> > >> > > > > > > brokers and partitions. Have you given any thought to
> that?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <
> > >> xiongqiwu@gmail.com>
> > >> > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Eno, Dong,
> > >> > > > > > > >
> > >> > > > > > > > I have updated the KIP. We decide not to address the
> issue
> > >> that
> > >> > > we
> > >> > > > > might
> > >> > > > > > > > have for both compaction and time retention enabled
> topics
> > >> (see
> > >> > > the
> > >> > > > > > > > rejected alternative item 2). This KIP will only ensure
> > log
> > >> can
> > >> > > be
> > >> > > > > > > > compacted after a specified time-interval.
> > >> > > > > > > >
> > >> > > > > > > > As suggested by Dong, we will also enforce "
> > >> > > max.compaction.lag.ms"
> > >> > > > > is
> > >> > > > > > > not
> > >> > > > > > > > less than "min.compaction.lag.ms".
> > >> > > > > > > >
> > >> > > > > > > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > >> > > > > Time-based
> > >> > > > > > > log
> > >> > > > > > > > compaction policy
> > >> > > > > > > > <
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > >> > > > > Time-based
> > >> > > > > > > log compaction policy>
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu <
> > >> > xiongqiwu@gmail.com
> > >> > > >
> > >> > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > Per discussion with Dong, he made a very good point
> that
> > >> if
> > >> > > > > compaction
> > >> > > > > > > > > and time based retention are both enabled on a topic,
> > the
> > >> > > > > compaction
> > >> > > > > > > > might
> > >> > > > > > > > > prevent records from being deleted on time. The reason
> > is
> > >> > when
> > >> > > > > > > compacting
> > >> > > > > > > > > multiple segments into one single segment, the newly
> > >> created
> > >> > > > > segment
> > >> > > > > > > will
> > >> > > > > > > > > have same lastmodified timestamp as latest original
> > >> segment.
> > >> > We
> > >> > > > > lose
> > >> > > > > > > the
> > >> > > > > > > > > timestamp of all original segments except the last
> one.
> > >> As a
> > >> > > > > result,
> > >> > > > > > > > > records might not be deleted as it should be through
> > time
> > >> > based
> > >> > > > > > > > retention.
> > >> > > > > > > > >
> > >> > > > > > > > > With the current KIP proposal, if we want to ensure
> > timely
> > >> > > > > deletion, we
> > >> > > > > > > > > have the following configurations:
> > >> > > > > > > > > 1) enable time based log compaction only : deletion is
> > >> done
> > >> > > > though
> > >> > > > > > > > > overriding the same key
> > >> > > > > > > > > 2) enable time based log retention only: deletion is
> > done
> > >> > > though
> > >> > > > > > > > > time-based retention
> > >> > > > > > > > > 3) enable both log compaction and time based
> retention:
> > >> > > Deletion
> > >> > > > > is not
> > >> > > > > > > > > guaranteed.
> > >> > > > > > > > >
> > >> > > > > > > > > Not sure if we have use case 3 and also want deletion
> to
> > >> > happen
> > >> > > > on
> > >> > > > > > > time.
> > >> > > > > > > > > There are several options to address deletion issue
> when
> > >> > enable
> > >> > > > > both
> > >> > > > > > > > > compaction and retention:
> > >> > > > > > > > > A) During log compaction, looking into record
> timestamp
> > to
> > >> > > delete
> > >> > > > > > > expired
> > >> > > > > > > > > records. This can be done in compaction logic itself
> or
> > >> use
> > >> > > > > > > > > AdminClient.deleteRecords() . But this assumes we have
> > >> record
> > >> > > > > > > timestamp.
> > >> > > > > > > > > B) retain the lastModifed time of original segments
> > during
> > >> > log
> > >> > > > > > > > compaction.
> > >> > > > > > > > > This requires extra meta data to record the
> information
> > or
> > >> > not
> > >> > > > > grouping
> > >> > > > > > > > > multiple segments into one during compaction.
> > >> > > > > > > > >
> > >> > > > > > > > > If we have use case 3 in general, I would prefer
> > solution
> > >> A
> > >> > and
> > >> > > > > rely on
> > >> > > > > > > > > record timestamp.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > Two questions:
> > >> > > > > > > > > Do we have use case 3? Is it nice to have or must
> have?
> > >> > > > > > > > > If we have use case 3 and want to go with solution A,
> > >> should
> > >> > we
> > >> > > > > > > introduce
> > >> > > > > > > > > a new configuration to enforce deletion by timestamp?
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi wu <
> > >> > > xiongqiwu@gmail.com
> > >> > > > >
> > >> > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > >> Dong,
> > >> > > > > > > > >>
> > >> > > > > > > > >> Thanks for the comment.
> > >> > > > > > > > >>
> > >> > > > > > > > >> There are two retention policy: log compaction and
> time
> > >> > based
> > >> > > > > > > retention.
> > >> > > > > > > > >>
> > >> > > > > > > > >> Log compaction:
> > >> > > > > > > > >>
> > >> > > > > > > > >> we have use cases to keep infinite retention of a
> topic
> > >> > (only
> > >> > > > > > > > >> compaction). GDPR cares about deletion of PII
> (personal
> > >> > > > > identifiable
> > >> > > > > > > > >> information) data.
> > >> > > > > > > > >> Since Kafka doesn't know what records contain PII, it
> > >> relies
> > >> > > on
> > >> > > > > upper
> > >> > > > > > > > >> layer to delete those records.
> > >> > > > > > > > >> For those infinite retention uses uses, kafka needs
> to
> > >> > > provide a
> > >> > > > > way
> > >> > > > > > > to
> > >> > > > > > > > >> enforce compaction on time. This is what we try to
> > >> address
> > >> > in
> > >> > > > this
> > >> > > > > > > KIP.
> > >> > > > > > > > >>
> > >> > > > > > > > >> Time based retention,
> > >> > > > > > > > >>
> > >> > > > > > > > >> There are also use cases that users of Kafka might
> want
> > >> to
> > >> > > > expire
> > >> > > > > all
> > >> > > > > > > > >> their data.
> > >> > > > > > > > >> In those cases, they can use time based retention of
> > >> their
> > >> > > > topics.
> > >> > > > > > > > >>
> > >> > > > > > > > >>
> > >> > > > > > > > >> Regarding your first question, if a user wants to
> > delete
> > >> a
> > >> > key
> > >> > > > in
> > >> > > > > the
> > >> > > > > > > > >> log compaction topic, the user has to send a deletion
> > >> using
> > >> > > the
> > >> > > > > same
> > >> > > > > > > > key.
> > >> > > > > > > > >> Kafka only makes sure the deletion will happen under
> a
> > >> > certain
> > >> > > > > time
> > >> > > > > > > > >> periods (like 2 days/7 days).
> > >> > > > > > > > >>
> > >> > > > > > > > >> Regarding your second question. In most cases, we
> might
> > >> want
> > >> > > to
> > >> > > > > delete
> > >> > > > > > > > >> all duplicated keys at the same time.
> > >> > > > > > > > >> Compaction might be more efficient since we need to
> > scan
> > >> the
> > >> > > log
> > >> > > > > and
> > >> > > > > > > > find
> > >> > > > > > > > >> all duplicates. However, the expected use case is to
> > set
> > >> the
> > >> > > > time
> > >> > > > > > > based
> > >> > > > > > > > >> compaction interval on the order of days, and be
> larger
> > >> than
> > >> > > > 'min
> > >> > > > > > > > >> compaction lag". We don't want log compaction to
> happen
> > >> > > > frequently
> > >> > > > > > > since
> > >> > > > > > > > >> it is expensive. The purpose is to help low
> production
> > >> rate
> > >> > > > topic
> > >> > > > > to
> > >> > > > > > > get
> > >> > > > > > > > >> compacted on time. For the topic with "normal"
> incoming
> > >> > > message
> > >> > > > > > > message
> > >> > > > > > > > >> rate, the "min dirty ratio" might have triggered the
> > >> > > compaction
> > >> > > > > before
> > >> > > > > > > > this
> > >> > > > > > > > >> time based compaction policy takes effect.
> > >> > > > > > > > >>
> > >> > > > > > > > >>
> > >> > > > > > > > >> Eno,
> > >> > > > > > > > >>
> > >> > > > > > > > >> For your question, like I mentioned we have long time
> > >> > > retention
> > >> > > > > use
> > >> > > > > > > case
> > >> > > > > > > > >> for log compacted topic, but we want to provide
> ability
> > >> to
> > >> > > > delete
> > >> > > > > > > > certain
> > >> > > > > > > > >> PII records on time.
> > >> > > > > > > > >> Kafka itself doesn't know whether a record contains
> > >> > sensitive
> > >> > > > > > > > information
> > >> > > > > > > > >> and relies on the user for deletion.
> > >> > > > > > > > >>
> > >> > > > > > > > >>
> > >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin <
> > >> > > lindong28@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > > > >>
> > >> > > > > > > > >>> Hey Xiongqi,
> > >> > > > > > > > >>>
> > >> > > > > > > > >>> Thanks for the KIP. I have two questions regarding
> the
> > >> > > use-case
> > >> > > > > for
> > >> > > > > > > > >>> meeting
> > >> > > > > > > > >>> GDPR requirement.
> > >> > > > > > > > >>>
> > >> > > > > > > > >>> 1) If I recall correctly, one of the GDPR
> requirement
> > is
> > >> > that
> > >> > > > we
> > >> > > > > can
> > >> > > > > > > > not
> > >> > > > > > > > >>> keep messages longer than e.g. 30 days in storage
> > (e.g.
> > >> > > Kafka).
> > >> > > > > Say
> > >> > > > > > > > there
> > >> > > > > > > > >>> exists a partition p0 which contains message1 with
> > key1
> > >> and
> > >> > > > > message2
> > >> > > > > > > > with
> > >> > > > > > > > >>> key2. And then user keeps producing messages with
> > >> key=key2
> > >> > to
> > >> > > > > this
> > >> > > > > > > > >>> partition. Since message1 with key1 is never
> > overridden,
> > >> > > sooner
> > >> > > > > or
> > >> > > > > > > > later
> > >> > > > > > > > >>> we
> > >> > > > > > > > >>> will want to delete message1 and keep the latest
> > message
> > >> > with
> > >> > > > > > > key=key2.
> > >> > > > > > > > >>> But
> > >> > > > > > > > >>> currently it looks like log compact logic in Kafka
> > will
> > >> > > always
> > >> > > > > put
> > >> > > > > > > > these
> > >> > > > > > > > >>> messages in the same segment. Will this be an issue?
> > >> > > > > > > > >>>
> > >> > > > > > > > >>> 2) The current KIP intends to provide the capability
> > to
> > >> > > delete
> > >> > > > a
> > >> > > > > > > given
> > >> > > > > > > > >>> message in log compacted topic. Does such use-case
> > also
> > >> > > require
> > >> > > > > Kafka
> > >> > > > > > > > to
> > >> > > > > > > > >>> keep the messages produced before the given message?
> > If
> > >> > yes,
> > >> > > > > then we
> > >> > > > > > > > can
> > >> > > > > > > > >>> probably just use AdminClient.deleteRecords() or
> > >> time-based
> > >> > > log
> > >> > > > > > > > retention
> > >> > > > > > > > >>> to meet the use-case requirement. If no, do you know
> > >> what
> > >> > is
> > >> > > > the
> > >> > > > > > > GDPR's
> > >> > > > > > > > >>> requirement on time-to-deletion after user
> explicitly
> > >> > > requests
> > >> > > > > the
> > >> > > > > > > > >>> deletion
> > >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> > >> > > > > > > > >>>
> > >> > > > > > > > >>> Thanks,
> > >> > > > > > > > >>> Dong
> > >> > > > > > > > >>>
> > >> > > > > > > > >>>
> > >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi wu <
> > >> > > > xiongqiwu@gmail.com
> > >> > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > >>>
> > >> > > > > > > > >>> > Hi Eno,
> > >> > > > > > > > >>> >
> > >> > > > > > > > >>> > The GDPR request we are getting here at linkedin
> is
> > >> if we
> > >> > > > get a
> > >> > > > > > > > >>> request to
> > >> > > > > > > > >>> > delete a record through a null key on a log
> > compacted
> > >> > > topic,
> > >> > > > > > > > >>> > we want to delete the record via compaction in a
> > given
> > >> > time
> > >> > > > > period
> > >> > > > > > > > >>> like 2
> > >> > > > > > > > >>> > days (whatever is required by the policy).
> > >> > > > > > > > >>> >
> > >> > > > > > > > >>> > There might be other issues (such as orphan log
> > >> segments
> > >> > > > under
> > >> > > > > > > > certain
> > >> > > > > > > > >>> > conditions) that lead to GDPR problem but they are
> > >> more
> > >> > > like
> > >> > > > > > > > >>> something we
> > >> > > > > > > > >>> > need to fix anyway regardless of GDPR.
> > >> > > > > > > > >>> >
> > >> > > > > > > > >>> >
> > >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
> > >> > > > > > > > >>> >
> > >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno Thereska <
> > >> > > > > > > > eno.thereska@gmail.com>
> > >> > > > > > > > >>> > wrote:
> > >> > > > > > > > >>> >
> > >> > > > > > > > >>> > > Hello,
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > > Thanks for the KIP. I'd like to see a more
> precise
> > >> > > > > definition of
> > >> > > > > > > > what
> > >> > > > > > > > >>> > part
> > >> > > > > > > > >>> > > of GDPR you are targeting as well as some sort
> of
> > >> > > > > verification
> > >> > > > > > > that
> > >> > > > > > > > >>> this
> > >> > > > > > > > >>> > > KIP actually addresses the problem. Right now I
> > find
> > >> > > this a
> > >> > > > > bit
> > >> > > > > > > > >>> vague:
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > > "Ability to delete a log message through
> > compaction
> > >> in
> > >> > a
> > >> > > > > timely
> > >> > > > > > > > >>> manner
> > >> > > > > > > > >>> > has
> > >> > > > > > > > >>> > > become an important requirement in some use
> cases
> > >> > (e.g.,
> > >> > > > > GDPR)"
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > > Is there any guarantee that after this KIP the
> > GDPR
> > >> > > problem
> > >> > > > > is
> > >> > > > > > > > >>> solved or
> > >> > > > > > > > >>> > do
> > >> > > > > > > > >>> > > we need to do something else as well, e.g., more
> > >> KIPs?
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > > Thanks
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > > Eno
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM, xiongqi wu <
> > >> > > > > xiongqiwu@gmail.com>
> > >> > > > > > > > >>> wrote:
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> > > > Hi Kafka,
> > >> > > > > > > > >>> > > >
> > >> > > > > > > > >>> > > > This KIP tries to address GDPR concern to
> > fulfill
> > >> > > > deletion
> > >> > > > > > > > request
> > >> > > > > > > > >>> on
> > >> > > > > > > > >>> > > time
> > >> > > > > > > > >>> > > > through time-based log compaction on a
> > compaction
> > >> > > enabled
> > >> > > > > > > topic:
> > >> > > > > > > > >>> > > >
> > >> > > > > > > > >>> > > >
> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> > >> > > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>>
> > >> > > > > > > > >>> > > > 354%3A+Time-based+log+compaction+policy
> > >> > > > > > > > >>> > > >
> > >> > > > > > > > >>> > > > Any feedback will be appreciated.
> > >> > > > > > > > >>> > > >
> > >> > > > > > > > >>> > > >
> > >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
> > >> > > > > > > > >>> > > >
> > >> > > > > > > > >>> > >
> > >> > > > > > > > >>> >
> > >> > > > > > > > >>>
> > >> > > > > > > > >>
> > >> > > > > > > > >>
> > >> > > > > > > > >>
> > >> > > > > > > > >> --
> > >> > > > > > > > >> Xiongqi (Wesley) Wu
> > >> > > > > > > > >>
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > --
> > >> > > > > > > > > Xiongqi (Wesley) Wu
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > --
> > >> > > > > > > > Xiongqi (Wesley) Wu
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > --
> > >> > > > > > >
> > >> > > > > > > Brett Rann
> > >> > > > > > >
> > >> > > > > > > Senior DevOps Engineer
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Zendesk International Ltd
> > >> > > > > > >
> > >> > > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > >> > > > > > >
> > >> > > > > > > Mobile: +61 (0) 418 826 017
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Xiongqi (Wesley) Wu
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Xiongqi (Wesley) Wu
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Xiongqi (Wesley) Wu
> > >> >
> > >>
> > >>
> > >> --
> > >>
> > >> Brett Rann
> > >>
> > >> Senior DevOps Engineer
> > >>
> > >>
> > >> Zendesk International Ltd
> > >>
> > >> 395 Collins Street, Melbourne VIC 3000 Australia
> > >>
> > >> Mobile: +61 (0) 418 826 017
> > >>
> > >
> >
> >
> > --
> > Xiongqi (Wesley) Wu
> >
>
>
> --
>
> Brett Rann
>
> Senior DevOps Engineer
>
>
> Zendesk International Ltd
>
> 395 Collins Street, Melbourne VIC 3000 Australia
>
> Mobile: +61 (0) 418 826 017
>