You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/01/31 22:28:28 UTC

[GitHub] [druid] harinirajendran edited a comment on issue #11414: Kafka ingestion lag spikes up whenever tasks are rolling

harinirajendran edited a comment on issue #11414:
URL: https://github.com/apache/druid/issues/11414#issuecomment-1026275109

> I see. Thank you for confirming it. Your analysis seems correct to me. Now I'm curious what notices the supervisor was processing 🙂

@jihoonson @jasonk000 : I have some more updates wrt this issue. The supervisor actually is spending a lot of time in processing `runNotices` which is causing the `checkpointNotice` to wait in notices queue for a long time causing tasks to be stuck which results in ingestion lag. In our case

In our case, we have seen run notices take ~7s as shown in the graph below.
![Screen Shot 2022-01-31 at 4 13 50 PM](https://user-images.githubusercontent.com/9054348/151882120-44685a65-4e6d-4ee1-bfbc-667f751eed8b.png)
As a result of this, the notices queue gets backed up when the number of tasks are huge as each `runNotice ` takes a long time to process.
![Screen Shot 2022-01-31 at 4 17 03 PM](https://user-images.githubusercontent.com/9054348/151882291-6cea98e3-6494-464d-bb93-1af0559adbe6.png)

On further analysis, we realized that the bulk of 7s in `run_notice` processing is being spent in the [getAsyncStatus()](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1442) call in `discoverTasks` function. When the task boots up, it roughly takes ~5s to start the JVM and start the HTTP server. So, as a result [this](https://github.com/confluentinc/druid/blob/0.21.0-confluent/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1593) `Futures` take about ~6s to get the status of tasks that are just bootstrapping with retries resulting in `runNotice` taking such a long time.

So, it's the tasks bootstrap time and hence its inability to respond to `/status` call from the supervisor that is causing `run_notice` to take ~6s causing notices queue to be backed up causing starvation of `checkpoint_ notice` causing ingestion lag. Does it make sense?

Have you seen something similar on your end? How long do Kafka real-time tasks take to bootstrap on your deployments? (Also, we use Middle Managers as of today instead of Indexers).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org