You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/05/25 15:08:47 UTC

[GitHub] [druid] gianm removed a comment on issue #11231: Minimize memory utilization in Sinks/Hydrants for native batch ingestion

gianm removed a comment on issue #11231:
URL: https://github.com/apache/druid/issues/11231#issuecomment-847950401


   > I think the reason that the ingestion takes that long is that the data is intentionally somewhat pathological (even though it simulates a real case in production). It is a series of events over 30 years, every day in between having data. However, the data in every day is only in the order of ~100 rows. Thus there will be about ~10000 segments at the end, all pretty small.
   
   Ah, OK, that makes a bit more sense. I hope this case isn't common, though. Even if ingestion completes, people are going to be in for a rude surprise at query time when faced with the overhead of all these tiny segments.
   
   > By the way, I agree with you that ingesting a 1M row file in 1.5 hours sounds like too much. This is not because my changes; since the changes are removing work, it can only make things faster. One way to speed it up is to realize that for batch ingestion the intermediate persists don't have to be in the "segment" format. If we did intermediate persists (for batch only) in a different format (maybe a log data structure optimized for appends) and then created the real segment at the final merge/push phase then I believe things would be way faster.
   
   An excellent idea. There's some work that is sort of in this direction: search for `indexSpecForIntermediatePersists`. It isn't a different format, but it's removing some complexity from the segment format.
   
   > About future work. I used an experimental approach to this work. I knew that OOMs were an issue. I decided to take a look at dynamic ingestion since it is the most basic form. I found these issues, which I believe are orthogonal to other issues. If this proposal is accepted, I plan to implement it. Then after that I will use the same approach to hash & range partitioning. Thoughts?
   
   Well, sure, that's a good methodology, but I was hoping to have a crystal ball that lets us predict what we might find. It's interesting to think about since it might inform how we structure things today.
   
   My crystal ball, hazy as it may be, suggests that the work you're doing should apply to the first phase of hash/range partitioned ingest, because in those phases, we're also sort of maintaining a bunch of different data destinations at once (corresponding to the expected second-phase tasks). I'm not 100% sure how the code is structured or if some additional work is needed to get these improvements to apply. But logically it seems like they would make sense there too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org