You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/02/15 01:54:38 UTC

[GitHub] [druid] AlexanderSaydakov opened a new issue #12261: Accessing private parts of datasketches

AlexanderSaydakov opened a new issue #12261:
URL: https://github.com/apache/druid/issues/12261


   We have noticed this recent change https://github.com/apache/druid/commit/e648b01afba52ee5b980d28b16b636e346a86819
   
   This change introduced a piece of code which bypasses public API of DataSketches and accesses a private part of implementation.
   
   https://github.com/apache/druid/blob/033989eb1d8f4f91268b2d7d4d3dc73af7bf2c3f/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/SketchAggregator.java#L60
   
   We believe that it is problematic and can lead to wrong results.
   We have this in our documentation:
   /**
     * Although the gadget object is initially an UpdateSketch, in the context of a Union it is used
     * as a specialized buffer that happens to leverage much of the machinery of an UpdateSketch.
     * However, in this context some of the key invariants of the sketch algorithm are intentionally
     * violated as an optimization. As a result this object can not be considered as an UpdateSketch
     * and should never be exported as an UpdateSketch. It’s internal state is not necessarily
     * finalized and may contain garbage. Also its internal concept of “nominal entries” or “k” can
     * be meaningless. It is private for very good reasons.
     */
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #12261: Accessing private parts of datasketches

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #12261:
URL: https://github.com/apache/druid/issues/12261#issuecomment-1040597660


   Thanks a lot for pointing this out, @AlexanderSaydakov ! I had missed this while making the changes in the mentioned commit.
   
   What I really intended to do was get a better estimate of the memory used by a `org.apache.datasketches.theta.Union`.
   As the `Union` interface does not expose a `getCurrentBytes()` or similar method, I had to resort to the reflection.
   
   Please advise what would be a better way to achieve the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] leerho edited a comment on issue #12261: Accessing private parts of datasketches

Posted by GitBox <gi...@apache.org>.
leerho edited a comment on issue #12261:
URL: https://github.com/apache/druid/issues/12261#issuecomment-1041106476


   I have added a PR, mentioned just above, to directly address this issue.  It will be released with the next Java release which should be relatively soon since we also want to release a new KllDoublesSketch, plus a few other things.  
   
   I do respectfully request that you do not lock your "short-term" solution in a formal release.  
   
   Please note: the getCurrentBytes() method implemented here as well as in your "short-term" solution will only report a different value after the internal gadget goes through a resize when the current internal hash table is full. In between these resize events this method will return the value from the previous resize event.   If you serialize the union via toByteArray() the length of the byte array will exactly be the value returned by this method.  It also represents (approximately) how much RAM the union is using.  
   
   However, if you do `union.toByteArray().getResult()`, the size of the resulting CompactSketch will generally be much smaller as the union has been "pulled back to K" and compacted.  
   
   We don't recommend that you actually serialize the union to either store to disk or to transport to another machine because it is so much larger than the compact sketch you get when you getResult().  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #12261: Accessing private parts of datasketches

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #12261:
URL: https://github.com/apache/druid/issues/12261#issuecomment-1040597660


   Thanks a lot for pointing this out, @AlexanderSaydakov ! I had missed this while making the changes in the mentioned commit.
   
   What I really intended to do was get a better estimate of the memory used by a `org.apache.datasketches.theta.Union`.
   As the `Union` interface does not expose a `getCurrentBytes()` or similar method, I had to resort to the reflection.
   
   Please advise what would be a better way to achieve the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] imply-cheddar commented on issue #12261: Accessing private parts of datasketches

Posted by GitBox <gi...@apache.org>.
imply-cheddar commented on issue #12261:
URL: https://github.com/apache/druid/issues/12261#issuecomment-1041343764


   The value is currently being used to estimate the total size in memory of the onheap Aggregator object, so the semantics that you mentioned are exactly what we want.  It's never used to determine or allocate the number of bytes actually required to store, persist or transport the sketch.
   
   It's also perfectly fine that it doesn't always produce a new value.  That's understood and accounted for.
   
   Thanks for such a quick turnaround on an update to the base library.  As soon as that's release, let us know and we will update the code to remove the reflection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] leerho commented on issue #12261: Accessing private parts of datasketches

Posted by GitBox <gi...@apache.org>.
leerho commented on issue #12261:
URL: https://github.com/apache/druid/issues/12261#issuecomment-1041106476


   I have added a PR, mentioned just above, to directly address this issue.  It will be released with the next Java release which should be relatively soon since we also want to release a new KllDoublesSketch, plus a few other things.  
   
   I do respectfully request that you do not lock your "short-term" solution in a formal release.  
   
   Please note: the getCurrentBytes() method implemented here as well as in your "short-term" solution will only report a different value after the internal gadget goes through a resize when the current internal hash table is full. In between these resize events this method will return the value from the previous resize event.   If you serialize the union via toByteArray() the length of the byte array will exactly be the value returned by this method.  It also represents (approximately) how much RAM the union is using.  
   
   However, if you do union.toByteArray(union.getResult()), the size will generally be much smaller as the union has been "pulled back to K" and compacted.  
   
   We don't recommend that you actually serialize the union to either store to disk or to transport to another machine because it is so much larger than the compact sketch you get when you getResult().  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] leerho commented on issue #12261: Accessing private parts of datasketches

Posted by GitBox <gi...@apache.org>.
leerho commented on issue #12261:
URL: https://github.com/apache/druid/issues/12261#issuecomment-1040739418


   @kfaraz @AlexanderSaydakov @gianm @cheddar 
   1.  **PLEASE, engage with us and help us understand what it is you are trying to do. If it means adding a public API to accomplish it, we are more than willing to do that.**   
   2.  Whatever you do, never access internal private methods of these sketches.  
   3. We placed a detailed, and explicit warning Javadoc on the private gadget member precisely because we were worried that someone might try to do what you did.  
   [Warning Comment](https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/theta/UnionImpl.java#L59-L67)
   I don't know how we could have been more clear. 
   
   The memory currently being used by a Union or Sketch is rather complex:   
   
   - Do you mean on-heap or off-heap memory, or both?  
   - The largest chunk of memory is the array of hashes which is configured as a hashTable, where the fraction of that hashTable space consumed by valid hashes can vary enormously based on the input stream, and the exact state of the union state machine.  --- So, are you seeking to understand the size of the total space consumed by the sketch, regardless of what fraction is actively being used at that moment?  Unfortunately, this is dynamic, non-deterministic variable.   And if your application makes use of reset(), it will reset the space used down to a minimum.  
   - How do you intend to use this information?  Are you trying to forecast memory requirements in the future, based on some aggregate statistic across many sketches?  
   
   We have done quite a bit of work on modeling aggregate sketch memory usage where thousands or millions of sketches are concurrent in memory.  Perhaps this may be useful to you.  
   
   We like to use Druid's use of DataSketches as an exemplary model and we point folks to your website and your code base so that they can understand how DataSketches can be integrated properly.  We certainly don't want other platforms mimicking your accessing sketch internal private fields or methods.
   
   No one understands the internals of these sketches better than our DataSketches team, and we are very much interested in making sketches work optimally in the Druid environment.   But we cannot help you if you don't engage with us.
   
   In conclusion, it is my strong recommendation that you revert this commit (before it gets frozen in a release), and work with us to help you find a better, more public solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #12261: Accessing private parts of datasketches

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #12261:
URL: https://github.com/apache/druid/issues/12261#issuecomment-1040781523


   > PLEASE, engage with us and help us understand what it is you are trying to do. If it means adding a public API to accomplish it, we are more than willing to do that.
   
   Please don't take it personally that nobody reached out to you about the original problem. The Druid project is a big place with a lot of dependencies, and not all contributors are aware of how willing those various dependencies' maintainers are to work with us. Your willingness to work with us over the years is very much appreciated, as is the fact that you took the time to write up your concerns about this particular case.
   
   > The memory currently being used by a Union or Sketch is rather complex.
   
   The commit is from #12073, #12022 which are trying to improve memory estimation in the on-heap indexing implementation. The goal of the patch is to decide when to spill an on-heap index to disk. To do this properly, we need a way to estimate the memory usage of the index. The patch works by adding an `aggregateWithSize()` method to on-heap aggregators that (1) does an aggregation and (2) returns the amount of additional memory use caused by that particular aggregation. Each aggregator is associated with a particular row in the index, so we then sum up the memory used by all aggregators in all rows to get the memory footprint of the entire index.
   
   The private field in this case is being used by SketchAggregator to implement the memory-estimation piece of `aggregateWithSize()`.
   
   We shouldn't be using private fields like that, so we should find another way to solve the original problem.
   
   One obvious solution I can think of is to use a simple formula, like something that is based only on the `size` parameter and the number of items added so far. I'm not sure how accurate this would be since I'm not super familiar with how the sketches internally manage their on-heap memory. But maybe it'd be close enough? If we do go this way, an implementation note: at ingestion time, it's likely that a small number of items get added to each sketch. (Like in the 1 to 50 range.) I think some sketches have a special mode they use when the number of added items is low, and whenever that is true, we'd want to take that into account to avoid overcounting memory usage.
   
   @AlexanderSaydakov or @leerho do you have any other, better suggestions? & @kfaraz could you please work with them to implement a better solution when we come up with one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] imply-cheddar commented on issue #12261: Accessing private parts of datasketches

Posted by GitBox <gi...@apache.org>.
imply-cheddar commented on issue #12261:
URL: https://github.com/apache/druid/issues/12261#issuecomment-1040867884


   As gian mentioned, the code is reaching in and grabbing the _gadget as a short-term solution to grab the number of bytes used.  It would absolutely be preferable for the UnionSketch to also expose the same `.getCurrentBytes()` method.
   
   In terms of the implementation, the sketch that was grabbed is only being used to call `.getCurrentBytes()` and is used for no other purpose.  The memory counting structure works with deltas, so if reset was called and the number got smaller, there would then be a negative delta in memory usage and the memory consumption values would be updated accordingly.  
   
   The intention was always to open a ticket to request that the public API be extended.  We wanted to have a concrete example of why we needed the method rather than just asking for a random method, so we wanted the PR to actually exist before doing the request.  In the end, the request back to the datasketches got lost in the shuffle of getting the interface changes worked out, so that's our bad.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org