You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/06/14 11:01:32 UTC

[GitHub] [druid] liran-funaro commented on issue #7900: Develop a new Indexer process for running ingestion tasks

liran-funaro commented on issue #7900:
URL: https://github.com/apache/druid/issues/7900#issuecomment-643750934

I realize that I'm a little late for the party since `CliIndexer` is already merged, but I just want to raise a possible issue with this design.

Once many concurrent incremental-indexes will be processed on the same JVM heap, the number of the long-lived objects will be larger than any of the individual Peons.
Unfuretntly, the JVM does not handle well workloads with a huge number of long-lived objects.
This evidently causes long pause times for each GC cycle that can add up to up to 50% of the process runtime.
However, the value of using the `CliIndexer`, IMO, is great.

To solve this, I suggest storing all incremental index data (keys and values) off-heap, which will reduce the number of heap objects dramatically.
Please, check out my issue (#9967) and PR (#10001) that solves exactly this problem.

This solution improves the CPU and RAM utilization of the batch ingestion by over 50% in both serial and parallel ingestion modes, and might greatly improve the resource utilization and performance of the ingestion using the `CliIndexer`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org