You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by GitBox <gi...@apache.org> on 2022/01/13 08:56:57 UTC

[GitHub] [datasketches-java] hqx871 opened a new issue #382: HIP result is not stable

hqx871 opened a new issue #382:
URL: https://github.com/apache/datasketches-java/issues/382


   Hi, team. 
   When use the HllSketch, the Historical Inverse Probability or HIP result is not stable because delta is double. Code as follow:
   ```
     void addToHipAccum(final double delta) {
       hipAccum += delta;
     }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] leerho commented on issue #382: HIP result is not stable

Posted by GitBox <gi...@apache.org>.
leerho commented on issue #382:
URL: https://github.com/apache/datasketches-java/issues/382#issuecomment-1017794602


   This issue is resolved as I see no further issues from the author.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] leerho commented on issue #382: HIP result is not stable

Posted by GitBox <gi...@apache.org>.
leerho commented on issue #382:
URL: https://github.com/apache/datasketches-java/issues/382#issuecomment-1013579747


   1. In general, we can make no guarantee that sketches are deterministic.  Certain sketches under certain conditions may be deterministic, but you really have to dig into the details of the algorithm, how it is implemented and other factors.  Even those sketches that may not be deterministic, the overall error behavior should still be within the specified error and confidence level. 
   
   2. Nor can we guarantee that they are _exactly_ order-insensitive.  Some sketches under certain conditions may be order sensitive, but to the extent that they are,  the overall error behavior will still be within the specified error and confidence level.  Thus, because they will still be within the specified error bounds we can say that they are order tolerant.  
   
   3. The HIP estimator with HLL sketches really only applies for a single sketch.  Once you do a merge, the estimator automatically falls back to the composite estimator which has about 25% higher error rate than the HIP estimator.  There are some exceptions to this, but it is deep in the details. So your experiment will likely not qualify as a HIP experiment. (we will test this in a moment)
   
   Sketches are, by definition, approximate and probabilistic.  You should regard any single estimate or result from a sketch as a random draw from a known probability distribution.  Thus, to demand that a every sketch reproduce the same exact result even if it is fed the same stream in the same order makes no sense.   If you wish to test a sketch for its accuracy, then you must compare the results against a probability distribution with a standard deviation specified by the sketch documentation.
   
   Looking at your code I noticed a couple of problems:
   
   1. The sketch update statement: *sketch.update(random.nextInt(cardinality));* will likely have duplicates in the sequence because the periodicity of random integers in the range 0 to 2^12 is not very long and you are asking it to produce 512 unique values.  I noticed this because the 10 results from your test are values around 3820. For a sketch of size 2^15, with 512 unique updates should be in exact mode.  Merging 10 of these together will produce an HLL sketch in estimation mode with results of about 5120. If you really want to feed it random numbers use *sketch.update(random.nextDouble());* where the periodicity of random doubles is very long (~2^64), thus unlikely to produce duplicates. For integers you need to increase the possible range of values to all integers.
   
   2. If you want to measure accuracy, you need to track the exact number of unique values being fed to the sketch. Your code does not do that. Even using the hack above is risky, because you never can be sure that the random number generator didn't emit a duplicate. In our testing we use long-integer sequences that never reset.
   
   3. I also added a print statement to print the count, getEstimate(), and getCompositeEstimate().  This help us see what mode the sketch is in.
   
   For an HLL sketch configured with LgK = 15 and undergoing Union (merge) operations the sketch should have an accuracy of about +/- 1.15% with a confidence level of 95%.  
   
   Making these changes to your program, it produces output similar to the following:
   
   ```
   Count: 512 est: 512.0006497710449, 512.0006497710449
   Count: 512 est: 512.0006497710449, 512.0006497710449
   Count: 512 est: 512.0006497710449, 512.0006497710449
   Count: 512 est: 512.0006497710449, 512.0006497710449
   Count: 512 est: 512.0006497710449, 512.0006497710449
   Count: 512 est: 512.0006497710449, 512.0006497710449
   Count: 512 est: 512.0006497710449, 512.0006497710449
   Count: 512 est: 512.0006497710449, 512.0006497710449
   Count: 512 est: 512.0006497710449, 512.0006497710449
   Average: 5118.041780278979
   [5106.382728447485, 5110.360133852762, 5108.442673913953, 5125.525394459671, 5126.565287003847, 5127.345688954016, 5118.7484203718695, 5117.706201948056, 5134.169804185889, 5105.171469652231]
   ```
   Note that the getEstimate() and getCompositeEstimate() are producing the exact same values.  Since these individual sketches are in exact mode, both estimators will produce the same result. If you tried this experiment feeding each individual sketch about 4,000 values, you should see a difference between the two estimators
   
   Plotting the data from the modified code I generated the following plot.
   
   ![Picture1](https://user-images.githubusercontent.com/12941506/149603503-cbfca7f7-664b-404c-b31d-219ccf5c91d5.png)
   
   The X axis are the 10 sample points, the red and blue lines are +/- 2 Std Deviations from the exact value, and the green line shows the relative error.
   
   The sketch is performing well within the +/- 2 sigma bounds!  
   
   You should be very pleased.
   
   [HipDemo.java.txt](https://github.com/apache/datasketches-java/files/7874023/HipDemo.java.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] leerho closed issue #382: HIP result is not stable

Posted by GitBox <gi...@apache.org>.
leerho closed issue #382:
URL: https://github.com/apache/datasketches-java/issues/382


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] leerho commented on issue #382: HIP result is not stable

Posted by GitBox <gi...@apache.org>.
leerho commented on issue #382:
URL: https://github.com/apache/datasketches-java/issues/382#issuecomment-1013795037


   I see a typo in the plot: the vertical axis should be labeled: "% Error Relative to Exact"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] jmalkin commented on issue #382: HIP result is not stable

Posted by GitBox <gi...@apache.org>.
jmalkin commented on issue #382:
URL: https://github.com/apache/datasketches-java/issues/382#issuecomment-1012758678


   @hqx871 The issue here is that the HIP estimator is order-dependent. It also doesn't apply if merging 2 sketches that have transitioned to estimation mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] hqx871 commented on issue #382: HIP result is not stable

Posted by GitBox <gi...@apache.org>.
hqx871 commented on issue #382:
URL: https://github.com/apache/datasketches-java/issues/382#issuecomment-1012780978


   Thanks. I will try getCompositeEstimate instead 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] leerho commented on issue #382: HIP result is not stable

Posted by GitBox <gi...@apache.org>.
leerho commented on issue #382:
URL: https://github.com/apache/datasketches-java/issues/382#issuecomment-1012372473


   What do you mean "HIP result is not stable"?  Do you have any evidence that there is a problem?  What exactly is your concern?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] hqx871 commented on issue #382: HIP result is not stable

Posted by GitBox <gi...@apache.org>.
hqx871 commented on issue #382:
URL: https://github.com/apache/datasketches-java/issues/382#issuecomment-1012719806


   @leerho Thanks for your reply. By HIP result is not stable I mean when HllSketch choose hipAccum or Historical Inverse Probability strategy may get different result if change the unioned sketches order. As the follow code example, I get results:
   
   [3820.0975208182876, 3813.762046000671, 3827.557127714105, 3824.370010995233, 3821.2456442772445, 3817.0347462105, 3825.5043384701926, 3819.158133388752, 3813.826214879161, 3814.82872512458]
   
   
   ```
   import org.apache.datasketches.hll.HllSketch;
   import org.apache.datasketches.hll.TgtHllType;
   import org.apache.datasketches.hll.Union;
   import org.apache.datasketches.memory.Memory;
   
   import java.util.ArrayList;
   import java.util.List;
   import java.util.concurrent.ThreadLocalRandom;
   
   public class HipDemo
   {
     public static void main(String[] args)
     {
       final int lgK = 15;
       final int rowCount = 2 << 8;
       final int cardinality = 2 << (lgK - 3);
       final int sketchNum = 10;
       final List<HllSketch> sketches = new ArrayList<>();
       for (int i = 0; i < sketchNum; i++) {
         sketches.add(generateSketch(lgK, rowCount, cardinality));
       }
   
       List<Double> estimates = new ArrayList<>();
       for (int i = 0; i < 10; i++) {
         //change order caused different estimate result.
         estimates.add(runUnion(randOrder(sketches), lgK));
         //estimates.add(runUnion(sketches, lgK));
       }
       System.out.println(estimates);
     }
   
     public static double runUnion(List<HllSketch> sketches, int lgK)
     {
       Union union = new Union(lgK);
       for (HllSketch sketch : sketches) {
         union.update(sketch);
       }
       HllSketch result = union.getResult(TgtHllType.HLL_8);
       return result.getEstimate();
     }
   
     public static List<HllSketch> randOrder(List<HllSketch> sketches)
     {
       List<HllSketch> part1 = new ArrayList<>();
       List<HllSketch> part2 = new ArrayList<>();
       ThreadLocalRandom random = ThreadLocalRandom.current();
       for (HllSketch sketch : sketches) {
         if (random.nextBoolean()) {
           part1.add(sketch);
         } else {
           part2.add(sketch);
         }
       }
       List<HllSketch> all = new ArrayList<>();
       all.addAll(part1);
       all.addAll(part2);
       return all;
     }
   
     public static HllSketch generateSketch(int lgK, int count, int cardinality)
     {
       ThreadLocalRandom random = ThreadLocalRandom.current();
       HllSketch sketch = new HllSketch(lgK);
       for (int j = 0; j < count; j++) {
         sketch.update(random.nextInt(cardinality));
       }
       byte[] bytes = sketch.toCompactByteArray();
       sketch = HllSketch.wrap(Memory.wrap(bytes));
       return sketch;
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org