You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2021/05/07 16:05:08 UTC

[GitHub] [kafka] soarez opened a new pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

soarez opened a new pull request #10650:
URL: https://github.com/apache/kafka/pull/10650


   The following exception shows that `log.logSegments` may be empty during `checkpointLogStartOffsets`.
   `logSegments.headOption` should be used instead of `logSegments.head` to prevent this exception.
   
   ```
   {
     "class": "java.util.NoSuchElementException",
     "msg": null,
     "stack": [
       "java.util.concurrent.ConcurrentSkipListMap$ValueIterator.next(ConcurrentSkipListMap.java:2123)",
       "scala.collection.convert.JavaCollectionWrappers$JIteratorWrapper.next(JavaCollectionWrappers.scala:38)",
       "scala.collection.IterableOps.head(Iterable.scala:218)",
       "scala.collection.IterableOps.head$(Iterable.scala:218)",
       "scala.collection.AbstractIterable.head(Iterable.scala:920)",
       "kafka.log.LogManager$$anonfun$4.applyOrElse(LogManager.scala:640)",
       "kafka.log.LogManager$$anonfun$4.applyOrElse(LogManager.scala:639)",
       "scala.collection.Iterator$$anon$7.hasNext(Iterator.scala:516)",
       "scala.collection.mutable.Growable.addAll(Growable.scala:61)",
       "scala.collection.mutable.Growable.addAll$(Growable.scala:59)",
       "scala.collection.mutable.HashMap.addAll(HashMap.scala:111)",
       "scala.collection.mutable.HashMap$.from(HashMap.scala:549)",
       "scala.collection.mutable.HashMap$.from(HashMap.scala:542)",
       "scala.collection.MapFactory$Delegate.from(Factory.scala:425)",
       "scala.collection.MapOps.collect(Map.scala:283)",
       "scala.collection.MapOps.collect$(Map.scala:282)",
       "scala.collection.AbstractMap.collect(Map.scala:375)",
       "kafka.log.LogManager.$anonfun$checkpointLogStartOffsetsInDir$2(LogManager.scala:639)",
       "kafka.log.LogManager.$anonfun$checkpointLogStartOffsetsInDir$1(LogManager.scala:636)",
       "kafka.log.LogManager.checkpointLogStartOffsetsInDir(LogManager.scala:635)",
       "kafka.log.LogManager.$anonfun$checkpointLogStartOffsets$1(LogManager.scala:600)",
       "kafka.log.LogManager.$anonfun$checkpointLogStartOffsets$1$adapted(LogManager.scala:600)",
       "scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)",
       "scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)",
       "scala.collection.AbstractIterable.foreach(Iterable.scala:920)",
       "kafka.log.LogManager.checkpointLogStartOffsets(LogManager.scala:600)",
       "kafka.log.LogManager.$anonfun$startup$6(LogManager.scala:426)",
       "kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)",
       "java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)",
       "java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)",
       "java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)",
       "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)",
       "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)",
       "java.lang.Thread.run(Thread.java:834)"
     ]
   }
   ```
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] kowshik commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
kowshik commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-844632594


   @soarez Thanks for the PR. The JIRA says the issue was encountered in Kafka version 2.6.2. Are you able to also reproduce this issue on AK trunk (latest code) and/or in versions released after 2.6.2?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ijuma commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
ijuma commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-843373430


   @soarez the concern is that it may hide a problem. So, it's important to understand the sequence of events.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
soarez commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-843367411


   Hi @dhruvilshah3
   
   This happened when a broker that was offline for a long time came back up. I agree this shouldn't happen in the first place. I have logs, but it isn't clear how this happened.
   
   Still, given that it can happen, it seems to me that it is still better to avoid this exception. And I see no downsides in adding the extra check.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
soarez commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-846401654


   @kowshik I see, that makes sense. My theory about how this happens does not hold. The race must happen after the constructor exits.
   
   I'm going to patch our brokers to log extra information when if this exception is thrown and will try to reproduce in the same cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] kowshik edited a comment on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
kowshik edited a comment on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-846318535


   @soarez The `Log.loadSegments()` call happens in `Log` [constructor](https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/log/Log.scala#L287). I'm curious how would the `segments.clear()` call from there race with the `kafka-log-start-offset-checkpoint` periodic work, because the periodic work can't see the `Log` instance until it is [instantiated](https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/log/LogManager.scala#L769-L784) and made discoverable by adding to `LogManager.currentLogs` or `LogManager.futureLogs`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
soarez commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-834576169


   @ijuma @junrao please take a look at this one liner.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
soarez commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-846136326


   @ijuma I can see it makes to not hide the underlying issue. But at the moment, the exception provides no helpful information either. What do you think about adding the topic name and partition number to the exception message when this failure happens?
   
   @kowshik No. This is a tricky one to reproduce, it looks like a race condition. However, there are some clues about what happened.
   
   The exception happened shortly after the broker that had been offline for over a week was started again in a busy cluster. And just before this `NoSuchElementException` was thrown there were many warnings of this sort:
   
   ```
   [Log partition=foo-1, dir=/d3/data] Could not find offset index file corresponding to log file /d3/data/foo-1/00000000000003424067.log, recovering segment and rebuilding index files...
   ```
   
   Looking at the 2.6.2 source, this message is only produced in [`Log.loadSegmentFiles`](https://github.com/apache/kafka/blob/25f4b2337d8199cd1420516c2a817d2e0d878cb4/core/src/main/scala/kafka/log/Log.scala#L624), which is only called from [`Log.loadSegments`](https://github.com/apache/kafka/blob/25f4b2337d8199cd1420516c2a817d2e0d878cb4/core/src/main/scala/kafka/log/Log.scala#L713). And right before that call there is a [`segments.clear()`](https://github.com/apache/kafka/blob/25f4b2337d8199cd1420516c2a817d2e0d878cb4/core/src/main/scala/kafka/log/Log.scala#L712) which will cause [`logSegments`](https://github.com/apache/kafka/blob/25f4b2337d8199cd1420516c2a817d2e0d878cb4/core/src/main/scala/kafka/log/Log.scala#L2144) to be empty. I think that's how this happened.
   
   Looking at `trunk` the [segments are also cleared](https://github.com/apache/kafka/blob/0b464419e2e4f28444fca653ae5aa8dd7feae9ba/core/src/main/scala/kafka/log/LogLoader.scala#L105-L106) in the same way.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] kowshik commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
kowshik commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-846142330


   @soarez Nice explanation. Regarding the segments getting cleared, note that the call to `recoverLog()` will eventually ensure that there is at least the active segment, see [these lines](https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/log/Log.scala#L813-L822). So it is surprising that you saw the exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
soarez commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-843352974


   @ijuma can you have a look at this one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] kowshik edited a comment on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
kowshik edited a comment on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-846318535


   @soarez The `Log.loadSegments()` call happens in `Log` [constructor](https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/log/Log.scala#L287). I'm curious how would the `segments.clear()` call from there race with the `kafka-log-start-offset-checkpoint` periodic work, because the periodic work can't see the `Log` instance until it is [instantiated](https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/log/LogManager.scala#L769-L784) and made discoverable by adding to `LogManager.currentLogs` and `LogManager.futureLogs`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
soarez commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-834575330


   https://issues.apache.org/jira/browse/KAFKA-12763


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] kowshik commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
kowshik commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-846318535


   @soarez The `Log.loadSegments()` call happens in `Log` [constructor](https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/log/Log.scala#L287). I'm curious how would the `segments.clear()` call from there race with the `kafka-log-start-offset-checkpoint` periodic work (the periodic work can't see the `Log` instance until it is [instantiated](https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/log/LogManager.scala#L769-L784)) and made discoverable by adding to `LogManager.currentLogs` and `LogManager.futureLogs`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on pull request #10650: KAFKA-12763 NoSuchElementException during checkpointLogStartOffsets

Posted by GitBox <gi...@apache.org>.
soarez commented on pull request #10650:
URL: https://github.com/apache/kafka/pull/10650#issuecomment-846301061


   @kowshik I'm not sure it is surprising. The `NoSuchElementException` is thrown from the scheduled task `kafka-log-start-offset-checkpoint` - see original stack trace - which runs in a separate thread, and as far as I can see there is no synchronization logic to ensure that it won't access segments between the `segments.clear()` I pointed out, and the subsequent lines you refer to. There a time window there were the segments are actually empty.
   
   There is an initial delay on the scheduled task of 30 seconds. I'm guessing that initial delay is there to let all the logs load up and recover correctly. But if there are a lot of logs to recover - like in the case of an inactive broker coming back after a long time in a busy cluster - that may not be enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org