You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by GitBox <gi...@apache.org> on 2020/07/21 22:49:51 UTC

[GitHub] [incubator-datasketches-java] leerho commented on issue #325: Theta sketch intersection estimation value is greater than source sketch estimation value

leerho commented on issue #325:
URL: https://github.com/apache/incubator-datasketches-java/issues/325#issuecomment-662145895


   @phstudy 
   Thank you for providing sufficient information & code so that we could actually understand and replicate your issue.
   
   You example is not a bug, in fact given the two sketches you provided, the results are correct.
   
   The fundamental issue is that you are intersecting two sketches, one estimating 264M uniques and the other 21K uniques, both with lgK = 12 or K=4096, .... but the intersection had only **one** retained item! 
   
   Intersections can produce much larger errors than normal sketching or unioning as is discussed on our [website](https://datasketches.apache.org/docs/Theta/ThetaAccuracyPlots.html).  The fundamental problem is that the intersection operation reduces your sample size and in this case it reduced your sample size down to only one sample!  So you cannot expect reasonable accuracy with only one sample.  In fact, you were lucky, it could have returned an estimate of zero.  Would have been happier with that?
   
   When doing intersections it is always a good idea to print out the upper and lower bounds along with the estimate. If you had done that, you would have discovered that the range of 95% confidence was (LB, Est, UB) = {1484, 65502, 366548}, which is huge! 
   
   The sketch is telling  you that the true value of the intersection is somewhere between 1484 and 366,548!  That is a clue that the sketch is not very confident of the result!
   
   From the sketch you can also print out other information that tells you a great deal about what is going on inside the sketch.  I added these extra print statements to your code (PhstudyTest below) so that you can learn to use these tools.  The output from the modified test (PhstudyTestResults below) reveals a great deal about what your example is doing.  Of course, the first really big clue is the line: 
   
   `   Retained Entries        : 1`
   
   From both the preamble output as well as the sketch.toString() output, I could see that all of the sketches and the intersection are behaving quite normally.  (Note: I did not include the long Base64 strings, since you already have them.)
   
   
   Cheers,
   Lee.
   
   ```
       public class PhstudyTest {
         @SuppressWarnings("javadoc")
         @Test
         public void checkPhstudy() {
             byte[] sketch1Arr = Base64.getDecoder().decode("<sketch1Base64>");
             PreambleUtil.preambleToString(sketch1Arr);
             final Memory serializedSketch = Memory.wrap(sketch1Arr);
             Sketch sketch1 = Sketch.wrap(serializedSketch, DEFAULT_UPDATE_SEED);
             println(Sketch.toString(sketch1Arr));
             println(sketch1.toString());
       
             byte[] sketch2Arr = Base64.getDecoder().decode("<sketch2Base64>");
             final Memory serializedSketch2 = Memory.wrap(sketch2Arr);
             Sketch sketch2 = Sketch.wrap(serializedSketch2, DEFAULT_UPDATE_SEED);
             println(Sketch.toString(sketch2Arr));
             println(sketch2.toString());
       
             Intersection inter = SetOperation.builder().buildIntersection();
             Sketch intSketch = inter.intersect(sketch1, sketch2);
             println(intSketch.toString());
         }
         static void println(Object o) { System.out.println(o.toString()); }
       }
   ```
   
   ```
   PhstudyTest Results:
   
   ### SKETCH PREAMBLE SUMMARY:
   Byte  0: Preamble Longs       : 3
   Byte  0: ResizeFactor         : X1
   Byte  1: Serialization Version: 3
   Byte  2: Family               : COMPACT
   Byte  3: LgNomLongs           : 0
   Byte  4: LgArrLongs           : 0
   Byte  5: Flags Field          : 00011010, 26
     (Native Byte Order)         : LITTLE_ENDIAN
     BIG_ENDIAN_STORAGE          : false
     READ_ONLY                   : true
     EMPTY                       : false
     COMPACT                     : true
     ORDERED                     : true
     SINGLEITEM  (derived)       : false
   Bytes 6-7  : Seed Hash        : 93cc
   Bytes 8-11 : CurrentCount     : 4096
   Bytes 12-15: P                : 0.0
   Bytes 16-23: Theta (double)   : 1.5503161636074036E-5
                Theta (long)     : 142991427517005
                Theta (long,hex) : 0000820cc93e3a4d
   Preamble Bytes                : 24
   Data Bytes                    : 32768
   TOTAL Sketch Bytes            : 32792
   ### END SKETCH PREAMBLE SUMMARY
   
   
   ### DirectCompactOrderedSketch SUMMARY: 
      Estimate                : 2.6420417306809786E8
      Upper Bound, 95% conf   : 2.726232570287611E8
      Lower Bound, 95% conf   : 2.5604410472331813E8
      Theta (double)          : 1.5503161636074036E-5
      Theta (long)            : 142991427517005
      Theta (long) hex        : 0000820cc93e3a4d
      EstMode?                : true
      Empty?                  : false
      Retained Entries        : 4096
      Seed Hash               : 93cc | 37836
   ### END SKETCH SUMMARY
   
   
   ### SKETCH PREAMBLE SUMMARY:
   Byte  0: Preamble Longs       : 3
   Byte  0: ResizeFactor         : X1
   Byte  1: Serialization Version: 3
   Byte  2: Family               : COMPACT
   Byte  3: LgNomLongs           : 0
   Byte  4: LgArrLongs           : 0
   Byte  5: Flags Field          : 00011010, 26
     (Native Byte Order)         : LITTLE_ENDIAN
     BIG_ENDIAN_STORAGE          : false
     READ_ONLY                   : true
     EMPTY                       : false
     COMPACT                     : true
     ORDERED                     : true
     SINGLEITEM  (derived)       : false
   Bytes 6-7  : Seed Hash        : 93cc
   Bytes 8-11 : CurrentCount     : 4096
   Bytes 12-15: P                : 1.0
   Bytes 16-23: Theta (double)   : 0.19793567670940415
                Theta (long)     : 1825634385657445494
                Theta (long,hex) : 1955f4cd16d9bc76
   Preamble Bytes                : 24
   Data Bytes                    : 32768
   TOTAL Sketch Bytes            : 32792
   ### END SKETCH PREAMBLE SUMMARY
   
   
   ### DirectCompactOrderedSketch SUMMARY: 
      Estimate                : 20693.591312562978
      Upper Bound, 95% conf   : 21283.962960450415
      Lower Bound, 95% conf   : 20119.498939949455
      Theta (double)          : 0.19793567670940415
      Theta (long)            : 1825634385657445494
      Theta (long) hex        : 1955f4cd16d9bc76
      EstMode?                : true
      Empty?                  : false
      Retained Entries        : 4096
      Seed Hash               : 93cc | 37836
   ### END SKETCH SUMMARY
   
   
   ### HeapCompactOrderedSketch SUMMARY: 
      Estimate                : 64502.97194045358
      Upper Bound, 95% conf   : 366548.2818845309
      Lower Bound, 95% conf   : 1484.0
      Theta (double)          : 1.5503161636074036E-5
      Theta (long)            : 142991427517005
      Theta (long) hex        : 0000820cc93e3a4d
      EstMode?                : true
      Empty?                  : false
      Retained Entries        : 1
      Seed Hash               : 93cc | 37836
   ### END SKETCH SUMMARY
   ```
   
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org