You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/05/17 18:41:28 UTC
[GitHub] [druid] loquisgon edited a comment on issue #11231: Minimize memory utilization in Sinks/Hydrants for native batch ingestion

loquisgon edited a comment on issue #11231:
URL: https://github.com/apache/druid/issues/11231#issuecomment-842544860


   @jihoonson I see your point that you still need clarification of what needs to be done. Yet I am hesitant to do another pass to the document above because it might muddle things further. However, let me tell you precisely, briefly what my concrete plan is. Analysis of the code, tests and preliminary coding strongly suggest that keeping the data structures for `Sink` and `Firehydrant` in memory can make ingestion run out of memory. Therefore my plan is pretty simple. 
   
   1.  After each persist just remove all references to `Sink` and `Firehidrant` and keeping just enough metadata in memory to recover them from disk later as needed (i.e. directory path for the `Sink`, metadata about the `Sink` like number of rows in memory so far, etc.) 
   2.  When new data arrives after a persist during the same ingestion for the file just recreate the `Sink` as usual and create new `Firehydrant` .
   3.  Repeat (1-2) as long as rows from the file are bing added.
   4.  At the end of processing all rows for the input file, just before the final `push` happens just recover the `Sink` & `Firehydrant` from disk, merge `Firehydrant` and push the `Sink`, for all `Sink` one by one.
   5.  Occasionally, when `maxRowsPerSegment` is hit in the `InputSourceProcessor` when a row was just added then a push will happen as well
   
   Therefore the scope for this proposal is strictly limited to the above to manage risk & complexity and achieve important value (i.e. drastically reducing the probability of OOM in these cases). The introduction of a new `Appenderator` is just common software engineering when we understand that it really should have a different code path from the real time case. I believe that this code physical & conceptual separation will open up new critical opportunities (such not using the `Sink` and `Firehidrant` data structure & layout for intermediate persists of batch and even maybe introducing a pre-sorting as well) but these future opportunities are out of scope for this proposal. So the end result is that when the proposal is implemented and merged most probably the code will still use previous patterns and data structures that may need to be improved & cleaned up in the future. Again, this is done for agility and incremental value delivery.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org