You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/02/15 20:53:01 UTC

[GitHub] [druid] gianm commented on issue #12261: Accessing private parts of datasketches

gianm commented on issue #12261:
URL: https://github.com/apache/druid/issues/12261#issuecomment-1040781523

> PLEASE, engage with us and help us understand what it is you are trying to do. If it means adding a public API to accomplish it, we are more than willing to do that.

Please don't take it personally that nobody reached out to you about the original problem. The Druid project is a big place with a lot of dependencies, and not all contributors are aware of how willing those various dependencies' maintainers are to work with us. Your willingness to work with us over the years is very much appreciated, as is the fact that you took the time to write up your concerns about this particular case.

> The memory currently being used by a Union or Sketch is rather complex.

The commit is from #12073, #12022 which are trying to improve memory estimation in the on-heap indexing implementation. The goal of the patch is to decide when to spill an on-heap index to disk. To do this properly, we need a way to estimate the memory usage of the index. The patch works by adding an `aggregateWithSize()` method to on-heap aggregators that (1) does an aggregation and (2) returns the amount of additional memory use caused by that particular aggregation. Each aggregator is associated with a particular row in the index, so we then sum up the memory used by all aggregators in all rows to get the memory footprint of the entire index.

The private field in this case is being used by SketchAggregator to implement the memory-estimation piece of `aggregateWithSize()`.

We shouldn't be using private fields like that, so we should find another way to solve the original problem.

One obvious solution I can think of is to use a simple formula, like something that is based only on the `size` parameter and the number of items added so far. I'm not sure how accurate this would be since I'm not super familiar with how the sketches internally manage their on-heap memory. But maybe it'd be close enough? If we do go this way, an implementation note: at ingestion time, it's likely that a small number of items get added to each sketch. (Like in the 1 to 50 range.) I think some sketches have a special mode they use when the number of added items is low, and whenever that is true, we'd want to take that into account to avoid overcounting memory usage.

@AlexanderSaydakov or @leerho do you have any other, better suggestions? & @kfaraz could you please work with them to implement a better solution when we come up with one?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org