You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/04/15 22:11:15 UTC

[GitHub] [druid] harinirajendran opened a new issue, #11414: Kafka ingestion lag spikes up whenever tasks are rolling

harinirajendran opened a new issue, #11414:
URL: https://github.com/apache/druid/issues/11414

   ### Affected Version
   
   0.21
   
   ### Description
   
   We have a druid cluster in which we are ingesting about 2.5M events/second. We have 90 ingestion tasks for 1 of our data sources with task duration set to 1 hr. Whenever the tasks roll every hour, Kafka ingestion lag spikes up anywhere from 3M to even 15M `druid.ingest.kafka.lag`. On further analysis, we noted that while tasks are rolling, some of the active ingestion tasks are stuck in `pause` state for a long time (sometimes up to 1.5-2 minutes) during which those tasks aren't ingesting any data resulting in ingestion lag spike.
   
   Logs from MM tasks with a huge gap between `pause` and `resume`
   ```
   {"@timestamp":"2021-06-21T17:34:51.628Z", "log.level":"DEBUG", "message":"Received pause command, pausing ingestion until resumed.", "service.name":"druid/middleManager","event.dataset":"druid/middleManager.log","process.thread.name":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
   {"@timestamp":"2021-06-21T17:36:27.089Z", "log.level":"DEBUG", "message":"Received pause command, pausing ingestion until resumed.", "service.name":"druid/middleManager","event.dataset":"druid/middleManager.log","process.thread.name":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
   {"@timestamp":"2021-06-21T17:36:27.097Z", "log.level":"DEBUG", "message":"Received resume command, resuming ingestion.", "service.name":"druid/middleManager","event.dataset":"druid/middleManager.log","process.thread.name":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
   ```  
   In the above loglines, we can see that that task was in `pause` state from `17:34:51` to `17:36:27`.
   
   On further analysis, we figured out that the MM taskRunner goes to a pause state when it is requesting a `checkpoint` notice [here](https://github.com/apache/druid/blob/8264203cee688607091232897749e959e7706010/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L729).
   From the time the taskRunner submits the checkpoint notice, it actually takes around 1.5 minutes for the coordinator to actually process this checkpoint notice.  We can see it in the coordinator logs below for a specific task.
   ```
   Jun 21, 2021 @ 17:34:51.624 Performing action for task[<task_id>]: CheckPointDataSourceMetadataAction{supervisorId='<supervisor_id>',          taskGroupId='14', checkpointMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamStartSequenceNumbers{stream='<kafka-topic>', partitionSequenceNumberMap={104=238760642689, 14=337995870870}, exclusivePartitions=[]}}} coordinator-0
   Jun 21, 2021 @ 17:34:51.624 Checkpointing [KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamStartSequenceNumbers{stream='<kafka-topic>', partitionSequenceNumberMap={104=238760642689, 14=337995870870}, exclusivePartitions=[]}}] for taskGroup [14] coordinator-0
   Jun 21, 2021 @ 17:36:27.086 Pause task[<task_id>] coordinator-0
   Jun 21, 2021 @ 17:36:27.087 HTTP POST: http://<MMHost:MMport>/druid/worker/v1/chat/<task_id>/pause  coordinator-0
   Jun 21, 2021 @ 17:36:27.089 SetEndOffsets task[<task_id>] endOffsets[{104=238763631003, 14=337998805846}] finalize[false] coordinator-0
   Jun 21, 2021 @ 17:36:27.089 Task [<task_id>] paused successfully  coordinator-0
   Jun 21, 2021 @ 17:36:27.091 HTTP POST: http://<MMHost:MMport>/druid/worker/v1/chat/<task_id>/offsets/end?finish=false coordinator-0
   Jun 21, 2021 @ 17:36:27.097 Handled checkpoint notice, new checkpoint is [{104=238763631003, 14=337998805846}] for taskGroup [14] coordinator-0
   ```
   
   Note that this long pause of ingestion task happens **only** when tasks are rolling. Not during other times.
   
   Our guess here is that, while tasks are rolling, the [notices queue](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L752) has a lot of notices in them and each notice takes a long time to be processed thus causing significant delay in the `checkpoint` notice as well to be processed once its added to the queue.
   
   Currently, we do not have logs in place to figure out how many notices are there in this queue at any point and how long does each notice takes to get executed.
   
   Spent some more time analyzing further and we saw that some of the following functions [here](https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1447) are taking multiple seconds when tasks are rolling as opposed to few milliseconds when tasks aren't.
   ```
         discoverTasks();
   
         updateTaskStatus();
   
         checkTaskDuration();
   
         checkPendingCompletionTasks();
   
         checkCurrentTaskState();
   ```
   Some of the above functions require checking task status and that seems to be taking a long time.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jasonk000 commented on issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling

Posted by "jasonk000 (via GitHub)" <gi...@apache.org>.
jasonk000 commented on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1546534220

   @harinirajendran , can you confirm if this is still an issue with Druid 25?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] harinirajendran commented on issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling

Posted by GitBox <gi...@apache.org>.
harinirajendran commented on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1100437789

   We solved this problem by switching from Middle Managers to Indexers. The hourly spikes don't happen anymore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1105898543

   > The log line I searched was something like `Started ServerConnector@6e475994{HTTP/1.1, (http/1.1)}{0.0.0.0:8102}` which happens roughly at the 4th second after the tasks start. And runNotice at the overlord gets stuck [here](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1442) for that duration until task can respond to the `/status` call. This is my observation. Does it make sense?
   
   It does make sense. Out of curiosity -- what did you have chatThreads set to? I wonder if your number of tasks is much higher than your number of chat threads, and that caused the delays to compound.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [I] Kafka ingestion lag spikes up whenever tasks are rolling (druid)

Posted by "layoaster (via GitHub)" <gi...@apache.org>.
layoaster commented on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1852413230

   FYI this is still an issue on Druid 27.0.0. I had set a task duration of 10 min (to reduce querying load on MMs peons) and you can see avg lag spikes matches the task roll period.
   
   ![image](https://github.com/apache/druid/assets/4516478/b8e7f31b-118c-4d02-9741-940b2df0ffcc)
   
   This is currently not a big deal because we have low traffic but when testing but higher volumes the spikes become a real burden because the lag spikes cannot be reduced within the 10 min period so the lag keeps increasing.
   
   And like the OP, the Kafka tasks reports the highest avg pending time.
   
   ![image](https://github.com/apache/druid/assets/4516478/363f3368-00e8-4ad5-8443-824431d08690)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] harinirajendran closed issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling

Posted by GitBox <gi...@apache.org>.
harinirajendran closed issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling
URL: https://github.com/apache/druid/issues/11414


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1311505442

   Some recent work in this area of SeekableStreamSupervisor:
   
   - Eliminate some unnecessary metadata calls: https://github.com/apache/druid/pull/13328
   - Don't ever queue more than one RunNotice at a time: https://github.com/apache/druid/pull/13334
   - Contact tasks asynchronously, eliminate `chatThreads`, reduce load on `workerThreads`: https://github.com/apache/druid/pull/13354
   
   We are targeting Druid 25 for all of these. If anyone has a chance to try these patches out in advance of the release, please let us know!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1105898687

   > The log line I searched was something like `Started ServerConnector@6e475994{HTTP/1.1, (http/1.1)}{0.0.0.0:8102}` which happens roughly at the 4th second after the tasks start. And runNotice at the overlord gets stuck [here](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1442) for that duration until task can respond to the `/status` call. This is my observation. Does it make sense?
   
   @harinirajendran It does make sense. Out of curiosity -- what did you have chatThreads set to? I wonder if your number of tasks is much higher than your number of chat threads, and that caused the delays to compound.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] harinirajendran commented on issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling

Posted by GitBox <gi...@apache.org>.
harinirajendran commented on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1108858672

   > > The log line I searched was something like `Started ServerConnector@6e475994{HTTP/1.1, (http/1.1)}{0.0.0.0:8102}` which happens roughly at the 4th second after the tasks start. And runNotice at the overlord gets stuck [here](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1442) for that duration until task can respond to the `/status` call. This is my observation. Does it make sense?
   > 
   > @harinirajendran It does make sense. Out of curiosity -- what did you have chatThreads set to? I wonder if your number of tasks is much higher than your number of chat threads, and that caused the delays to compound.
   
   @gianm : We just use the default value for chatThreads. The document says `default == min(10, taskCount * replicas)` . So, it should be 10 in our case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org