You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/11/30 08:30:02 UTC

[GitHub] [druid] samarthjain commented on issue #11658: Infinite automatic Kafka offset resetting

samarthjain commented on issue #11658:
URL: https://github.com/apache/druid/issues/11658#issuecomment-982398021


   @FrankChen021 - I believe the code is doing the right thing. It may be doing an unnecessary seek, but it shouldn't cause an infinite retry loop. 
   In line 140 when `recordSupplier.getEarliestSequenceNumber(streamPartition)` is called it does the following:
   ```
    Long currPos = getPosition(partition);
    seekToEarliest(Collections.singleton(partition)); 
    Long nextPos = getPosition(partition);
    seek(partition, currPos);
   ```
   So it gets the current position for partition, seeks to the earliest offset in that partition, gets the position, and then seeks back or restores to the position it was at. The earliest offset is returned and stored in `leastAvailableOffset` variable. 
   
   Then, at line 148 `recordSupplier.seek(streamPartition, nextOffset); `,  assumes that the call to `getEarliestSequenceNumber` possibly caused the current position to be different from the position it stored at the start in `nextOffset`. So it does an extra seek to restore the state as it was before.  The reason I said it may be unnecessary is because I saw log lines like this:
   ```
   getEarliestSequenceNumber() logs:
   Seeking to EARLIEST offset of partition logs_metrics-14
   Resetting offset for partition insight_logs_metrics-14 to offset 2944387224.
   Seeking to offset 22400192044 for partition logs_metrics-14
   
   and then
   // reset the seek 
   recordSupplier.seek(streamPartition, nextOffset); 
   Seeking to offset 22400192044 for partition logs_metrics-14
   ```
   As you can see, two seeks were performed to seek to offset 22400192044. 
   
   
   Having said that, I recently ran into an issue where even though I have `resetOffsetAutomatically` configured, Druid didn't call reset. This resulted in a stuck kafka consumer that just kept spewing `OffsetOutOfRangeException` messages. Looking closely at the code, it turned out that the offset reported in `outOfRangePartition` was (22400192044) which was  *higher* than the `leastAvailableOffset`  (2944387224)reported by Kafka. As a result the condition `leastAvailableOffset > nextOffset` was never met and the consumer just kept sleeping for 30 seconds before retrying again.  
   
   I wonder if you also have run into something similar? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org