You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Brett Rann (JIRA)" <ji...@apache.org> on 2018/07/06 07:57:00 UTC

[jira] [Created] (KAFKA-7137) ability to trigger compaction for tombstoning and GDPR

Brett Rann created KAFKA-7137:
---------------------------------

             Summary: ability to trigger compaction for tombstoning and GDPR
                 Key: KAFKA-7137
                 URL: https://issues.apache.org/jira/browse/KAFKA-7137
             Project: Kafka
          Issue Type: Wish
            Reporter: Brett Rann


Just spent some time wrapping my head around the inner workings of compaction and tombstoning, with a view to providing guarantees for deleting previous values of tombstoned keys from kafka within a desired time.

There's a couple of good posts that touch on this:
https://www.confluent.io/blog/handling-gdpr-log-forget/
http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/

Basically, log.cleaner.min.cleanable.ratio or min.cleanable.dirty.ratio is hijacked to force aggressive compaction (by setting it to 0, or 0.000000001 depending on what you read), and along with segment.ms can provide timing guarantees that a tombstone will result in any other values for the key will be deleted within a desired time.

But that sacrifices the utility of min.cleanable.dirty.ratio (and to a lesser extent, control over segment sizes).  On any duplicate key and a new segment roll it will run compaction, when otherwise it might be preferrable to allow a more generous dirty.ratio in the case of plain old duplicates.

It would be useful to have control over triggering a compaction without losing the utility of the dirty.ratio setting.

The pure need here is to specify a minimum time for the log cleaner to run on a topic that has keys replaced by a tombstone message that are past the minimum retention times provided by min.compaction.lag.ms

Something like a log.cleaner.max.delay.ms, and an API to trigger compaction, with some nuances to be fleshed out.

Does this make sense, and sound like it's worth a KIP? I'd be happy to write something up.

In the mean time, this can be worked around with some duct tape:

* make sure any values you want deleted by a tombstone have passed min retention configs
* set global log.cleaner.io.max.bytes.per.second to what you want for the compaction task
* set topic min.cleanable.dirty.ratio=0 for the topic
* set a small segment.ms
* wait for a new segment to roll (ms + a message coming in) and wait for compaction to kick in. GDPR met!
* undo the hacks




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)