You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/05/25 15:03:54 UTC

[GitHub] [druid] gianm edited a comment on issue #11231: Minimize memory utilization in Sinks/Hydrants for native batch ingestion

gianm edited a comment on issue #11231:
URL: https://github.com/apache/druid/issues/11231#issuecomment-844772014

@loquisgon thank you for the well written proposal.

I think it makes sense to think about improving batch behavior by leveraging differences in batch and realtime requirements, so I like the big picture idea.

About structuring the code: there isn't really any perfect way to do it, I think. Introducing a flag is best for minimizing code duplication, but if there are a lot of differences between the paths, they become tough to track since they're mixed together. So separating the classes seems like a good idea. I'd avoid a common superclass, since in cases where we have done it (IndexMerger, IncrementalIndex) I find the logic really hard to follow. There isn't a clear direction of control: sometimes the subclass calls into the superclass, and sometimes the superclass calls into the subclass. IMO the best approach is a shared "helper" class instead of a shared _superclass_, where control only flows in one direction (the main class calls the helper class; not the other way around).

About performance: how big in bytes was your 1M row test file? It looks like it took 60–90 mins to ingest, which seems like a really long time for just 1M rows. I'd expect being able to do it orders of magnitude faster than that. Did it take a long time because each row is really big, or is it related to the fact that there are a lot of segments? (Another way of asking: how long does it take to ingest the same 1M rows if the timestamps are adjusted to all be the same?) For datasets that worked without error prior to your changes, do your changes have a measurable effect on ingestion speed?

About future work: would you expect these changes to help with non-dynamic partitioning modes? For example, would these changes affect the pre-shuffle partial segment generation phase? Would you expect them to help? It would be interesting to hear your thoughts about future work in this area.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org