You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Steve Rodrigues (Jira)" <ji...@apache.org> on 2020/04/06 22:37:00 UTC
[jira] [Created] (KAFKA-9826) Log cleaning goes in infinite loop when first dirty offset is past start of active segment

Steve Rodrigues created KAFKA-9826:
--------------------------------------

             Summary: Log cleaning goes in infinite loop when first dirty offset is past start of active segment
                 Key: KAFKA-9826
                 URL: https://issues.apache.org/jira/browse/KAFKA-9826
             Project: Kafka
          Issue Type: Bug
          Components: log cleaner
    Affects Versions: 2.4.1
            Reporter: Steve Rodrigues
            Assignee: Steve Rodrigues


Seen on a system where a given partition had a single segment, and for whatever reason (deleteRecords?), the logStartOffset was greater than the base segment of the log, there were a continuous series of 

```

[2020-03-03 16:56:31,374] WARN Resetting first dirty offset of  FOO-3 to log start offset 55649 since the checkpointed offset 0 is invalid. (kafka.log.LogCleanerManager$)

```

messages (partition name changed, it wasn't really FOO). This was expected to be resolved by KAFKA-6266 but clearly wasn't. 

Further investigation revealed that  a few segments were continuously cleaning and generating messages in the `log-cleaner.log` of the form:

```

[2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 (kafka.log.LogCleaner)

[2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 for 0 segments in offset range [55287, 54237). (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior to Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 13:39:08 EST 2019)... (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner thread 1 cleaned log FOO-3 (dirty section = [55287, 55287])

0.0 MB of log processed in 0.0 seconds (0.0 MB/sec).

Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time)

Buffer utilization: 0.0%

Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time)

Start size: 0.0 MB (0 messages)

End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) (kafka.log.LogCleaner)

```

What seems to have happened here (data determined for a different partition) is:

There exist a number of partitions here which get relatively low traffic, including our friend FOO-5. For whatever reason, LogStartOffset of this partition has moved beyond the baseOffset of the active segment. (Notes in other issues indicate that this is a reasonable scenario.) So there’s one segment, starting at 166266, and a log start of 166301.

grabFilthiestCompactedLog runs and reads the checkpoint file. We see that this topicpartition needs to be cleaned, and call cleanableOffsets on it which returns an OffsetsToClean with firstDirtyOffset == logStartOffset == 166301 and firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = 116301, and forceCheckpoint = true.

The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix for KAFKA-6266). We then create a LogToClean object based on the firstDirtyOffset and firstUncleanableOffset of 166301 (past the active segment’s base offset).

The LogToClean object has cleanBytes = logSegments(-1, firstDirtyOffset).map(_.size).sum → the size of this segment. It has firstUncleanableOffset and cleanableBytes determined by calculateCleanableBytes. calculateCleanableBytes returns:
{{}}
{{val firstUncleanableSegment = log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}}
{{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}}
{{val cleanableBytes = log.logSegments(firstDirtyOffset, math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum

(firstUncleanableOffset, cleanableBytes)}}
firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, 166266) → which _is the active segment_

So there are “cleanableBytes” > 0.

We then filter out segments with totalbytes (clean + cleanable) > 0. This segment has totalBytes > 0, and it has cleanablebytes, so great! It’s filthiest.

The cleaner picks it, calls cleanLog on it, which then does cleaner.clean, which returns nextDirtyOffset and cleaner stats. cleaner.clean callls doClean, which builds an offsetMap. The offsetMap looks at non-active segments, when building, but there aren’t any. So the endOffset of the offsetMap is lastOffset (default -1) + 1 → 0. We record the stats (including logging to log-cleaner.log). After this we call cleanerManager.doneCleaning, which writes the checkpoint file with the latest value… of 0.

And then the process starts all over.

It appears that there's at least one bug here, where `log.logSegments(from, to)` will return an empty list if from == to and both are segment-aligned, but _not_ if they are in the middle of a segment, and possibly that LogToClean does start=firstDirtyOffset, end = max(firstDirtyOffset, firstUncleanableOffset) – it can move the firstUncleanableOffset even when the firstDirtyOffset is past firstUncleanable.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)