You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Jun Rao (Jira)" <ji...@apache.org> on 2020/04/15 05:29:00 UTC

[jira] [Resolved] (KAFKA-9826) Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment

     [ https://issues.apache.org/jira/browse/KAFKA-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Rao resolved KAFKA-9826.
----------------------------
    Fix Version/s: 2.6.0
       Resolution: Fixed

merged the PR to trunk

> Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-9826
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9826
>             Project: Kafka
>          Issue Type: Bug
>          Components: log cleaner
>    Affects Versions: 2.4.1
>            Reporter: Steve Rodrigues
>            Assignee: Steve Rodrigues
>            Priority: Major
>             Fix For: 2.6.0
>
>
> Seen on a system where a given partition had a single segment, and for whatever reason (deleteRecords?), the logStartOffset was greater than the base segment of the log, there were a continuous series of 
> ```
> [2020-03-03 16:56:31,374] WARN Resetting first dirty offset of  FOO-3 to log start offset 55649 since the checkpointed offset 0 is invalid. (kafka.log.LogCleanerManager$)
> ```
> messages (partition name changed, it wasn't really FOO). This was expected to be resolved by KAFKA-6266 but clearly wasn't. 
> Further investigation revealed that  a few segments were continuously cleaning and generating messages in the `log-cleaner.log` of the form:
> ```
> [2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 for 0 segments in offset range [55287, 54237). (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior to Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 13:39:08 EST 2019)... (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner thread 1 cleaned log FOO-3 (dirty section = [55287, 55287])
> 0.0 MB of log processed in 0.0 seconds (0.0 MB/sec).
> Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time)
> Buffer utilization: 0.0%
> Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time)
> Start size: 0.0 MB (0 messages)
> End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) (kafka.log.LogCleaner)
> ```
> What seems to have happened here (data determined for a different partition) is:
> There exist a number of partitions here which get relatively low traffic, including our friend FOO-5. For whatever reason, LogStartOffset of this partition has moved beyond the baseOffset of the active segment. (Notes in other issues indicate that this is a reasonable scenario.) So there’s one segment, starting at 166266, and a log start of 166301.
> grabFilthiestCompactedLog runs and reads the checkpoint file. We see that this topicpartition needs to be cleaned, and call cleanableOffsets on it which returns an OffsetsToClean with firstDirtyOffset == logStartOffset == 166301 and firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = 116301, and forceCheckpoint = true.
> The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix for KAFKA-6266). We then create a LogToClean object based on the firstDirtyOffset and firstUncleanableOffset of 166301 (past the active segment’s base offset).
> The LogToClean object has cleanBytes = logSegments(-1, firstDirtyOffset).map(_.size).sum → the size of this segment. It has firstUncleanableOffset and cleanableBytes determined by calculateCleanableBytes. calculateCleanableBytes returns:
> {{}}
> {{val firstUncleanableSegment = log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}}
> {{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}}
> {{val cleanableBytes = log.logSegments(firstDirtyOffset, math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum
> (firstUncleanableOffset, cleanableBytes)}}
> firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, 166266) → which _is the active segment_
> So there are “cleanableBytes” > 0.
> We then filter out segments with totalbytes (clean + cleanable) > 0. This segment has totalBytes > 0, and it has cleanablebytes, so great! It’s filthiest.
> The cleaner picks it, calls cleanLog on it, which then does cleaner.clean, which returns nextDirtyOffset and cleaner stats. cleaner.clean callls doClean, which builds an offsetMap. The offsetMap looks at non-active segments, when building, but there aren’t any. So the endOffset of the offsetMap is lastOffset (default -1) + 1 → 0. We record the stats (including logging to log-cleaner.log). After this we call cleanerManager.doneCleaning, which writes the checkpoint file with the latest value… of 0.
> And then the process starts all over.
> It appears that there's at least one bug here, where `log.logSegments(from, to)` will return an empty list if from == to and both are segment-aligned, but _not_ if they are in the middle of a segment, and possibly that LogToClean does start=firstDirtyOffset, end = max(firstDirtyOffset, firstUncleanableOffset) – it can move the firstUncleanableOffset even when the firstDirtyOffset is past firstUncleanable.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)