You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by GitBox <gi...@apache.org> on 2020/07/23 18:09:07 UTC

[GitHub] [incubator-datasketches-java] leerho commented on issue #326: Question related to union performance

leerho commented on issue #326:
URL: https://github.com/apache/incubator-datasketches-java/issues/326#issuecomment-663154127


   @Cheappie 
   We did a characterization study on this question which resulted in the last plot on the [CpcPerformance](https://datasketches.apache.org/docs/CPC/CpcPerformance.html) page on our website.  And this shows that our current implementation of CPC is faster than HLL, which is faster than Theta for both Java and C++. In addition C++ is faster than Java, as you would expect.  
   
   However, this particular characterization test held the number of sketches to be merged as a constant (32) and varied the total number of uniques equally distributed across the 32 sketches.  There are numerous other ways to construct a merge speed test.  I would suggest constructing a test that most closely resembles how merges are actually performed in your application.
   
   That being said, your particular test includes the time to create the union in addition to the merges and executes only two merges.  To reduce the influence of the union construction I would recommend doing a lot more than just 2 update operations.   
   
   In our back-end systems, a single union might be merging thousands to millions of sketches.  Also, it is not typical for all sketches to have the same cardinality, which we assumed in our test. Your test used two different cardinalities.   In the environments we encounter the sketch cardinalities tend to vary with a power-law distribution where only a few sketches are very large and then millions of sketches have very small cardinalities. Unfortunately, constructing meaningful tests for this kind of distribution is challenging because there are so many variables.  
   
   Nonetheless, the reason this is interesting is the Theta union operation has an "early-stop" feature that should speed up the unioning process considerably as the number of sketches in the merge loop gets large.  This might change the ordering of the performance ranking.  Constructing such a test is something we have been wanting to do but haven't yet gotten around to.  :)
   
   
   
   
   
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org