You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by "drewdahlke (via GitHub)" <gi...@apache.org> on 2024/02/27 18:53:10 UTC

[I] KLL improperly sensitive to merge ordering (datasketches-java)

drewdahlke opened a new issue, #514:
URL: https://github.com/apache/datasketches-java/issues/514

   The order that sketches get merged shouldn't matter much, yet I have a test showing otherwise. This repro gets through ~47 merges and then bombs out.  I tried this on 5.0.x but initially spotted the issue in a project still on 1.3.x so its nothing new. Commenting out Collections.shuffle makes the test pass. I'm on an intel MBP.
   
   `@Test
     public void kllMergeOrdering() {
       List<KllDoublesSketch> baseline = new ArrayList<>();
       for(int x = 0; x < 90; x++) {
         KllDoublesSketch kll = getUpdatableDirectDoublesSketch(200, 0);
         DoubleStream.iterate(0.0, i -> i + 100.0).limit(100).forEach(d -> kll.update(d));
         baseline.add(kll);
       }
   
       for(int iter = 0; iter < 1000; iter ++) {
         KllDoublesSketch kll = null;
         /******* Commenting out the shuffle makes it work?? **********/
         Collections.shuffle(baseline);
         for(KllDoublesSketch k : baseline) {
           if(kll == null) {
             kll = k;
           } else {
             kll.merge(k);
           }
         }
       }
     }`
   
   outputs
   
   `java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
   
   	at org.apache.datasketches.kll.KllDoublesHelper.populateDoubleWorkArrays(KllDoublesHelper.java:459)
   	at org.apache.datasketches.kll.KllDoublesHelper.mergeDoubleImpl(KllDoublesHelper.java:159)
   	at org.apache.datasketches.kll.KllDoublesSketch.merge(KllDoublesSketch.java:281)
   	at org.apache.datasketches.kll.KllDirectDoublesSketchTest.kllMergeOrdering(KllDirectDoublesSketchTest.java:67)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:136)
   	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:658)
   	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:219)
   	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
   	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:923)
   	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:192)
   	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
   	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [I] KLL improperly sensitive to merge ordering (datasketches-java)

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho commented on issue #514:
URL: https://github.com/apache/datasketches-java/issues/514#issuecomment-1967564450

   I am looking into this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [I] KLL improperly sensitive to merge ordering (datasketches-java)

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho closed issue #514: KLL improperly sensitive to merge ordering
URL: https://github.com/apache/datasketches-java/issues/514


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [I] KLL improperly sensitive to merge ordering (datasketches-java)

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho commented on issue #514:
URL: https://github.com/apache/datasketches-java/issues/514#issuecomment-1970067490

   @drewdahlke,
   What you are experiencing is not a problem with the sketch.  The problem is that your test code is corrupting your source stream, which you have modeled as an ArrayList<>.  
   
   ### Analysis
   Let me illustrate with some code snippets and test runs and I'll attach the whole test class I am using at the end.  I have made a few changes to your code:
   - To make it more easily modifiable, like putting some of your magic numbers at the top, which also makes the code more transparent.
   - I have renamed some of your variables to make it a bit more clear what their function is. Of these the most important is to distinguish between your source list and the merging target sketch.
   - I have reduced the number of sketches in your source list to 10 (from 90) as it demonstrates your problem easily and makes printing more reasonable. The shuffle is commented out initially.
   - I also moved the declaration of the merge target sketch above the trials loop. You will see why in a few moments.
   - I added a `printNarr(baseline)` to illustrate what is going on inside your source list.
   
   Your slightly modified test code looks like this:
   ```
    @Test
     public void kllMergeOrdering() {
       List<KllDoublesSketch> baseline = new ArrayList<>();
       final int numSk = 10; // was 90
       final int u = 100;
       final int k = 200;
       final int trials = 1000;
       for(int x = 0; x < numSk; x++) {
         KllDoublesSketch src = getUpdatableDirectDoublesSketch(k, 0, u);
         DoubleStream.iterate(0.0, i -> i + 100.0).limit(u).forEach(d -> src.update(d));
         baseline.add(src);
       }
   
       KllDoublesSketch tgt;  //getUpdatableDirectDoublesSketch(k, 0, u * numSk);
       for(int t = 0; t < trials; t ++) {
         print("TRIAL: " + t);
         tgt = null;  //.reset();
         /******* Commenting out the shuffle makes it work?? **********/
         //Collections.shuffle(baseline, rand);
         for (KllDoublesSketch src : baseline) {
           if (tgt == null) {
             tgt = src;
           } else {
             tgt.merge(src);
           }
           printNarr(baseline);
         }
       }
     }
   ```
   If you run this code as is, you will notice that it compiles and runs without errors.  But it is not working correctly!
   The top and bottom of the console printout looks like this:
   ```
   TRIAL: 0: [100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [200, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [300, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [400, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [500, 100, 100, 100, 100, 100, 100, 100, 100, 100]
      ...
   TRIAL: 999: [899200, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [899300, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [899400, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [899500, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [899600, 100, 100, 100, 100, 100, 100, 100, 100, 100]
      ...
   ```
   Whoa!  The first sketch of your source list is being modified every iteration of the inner loop! So your source data is not what you think it is!  The culprit is this: 
   ```
           ...
           if (tgt == null) {
             tgt = src;      // <-- this is assigning one of the sketches in the source list to be the target!  Oops!
           } else {
             tgt.merge(src);
           }
           ...
   ```
   The effect of that is that the first sketch in the list is also the target, which gets updated constantly.
   
   Now enable the shuffle by uncommenting it and look at the output:
   ```
   ...
   TRIAL: 59: [1625023502369262100, 210984021615402100, 6473720995753347100, 408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300, 105492010807899100, 816447775525480900, 25943560095700]
   : [1836007523984664200, 210984021615402100, 6473720995753347100, 408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300, 105492010807899100, 816447775525480900, 25943560095700]
   : [8309728519738011300, 210984021615402100, 6473720995753347100, 408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300, 105492010807899100, 816447775525480900, 25943560095700]
   : [8717995943465926000, 210984021615402100, 6473720995753347100, 408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300, 105492010807899100, 816447775525480900, 25943560095700]
   FAILED: kllMergeOrdering
   java.lang.AssertionError: The input count has exceeded the capacity of a long and the capability of this sketch.
   ```
   By trial 59, the all the sketches in the source list have been corrupted by being constantly interchanged with the actual target sketch.  Because the merge target sketch eventually appears in all the slots, the merge sum keep accumulating and essentially multiplying by 10 (the list size), which is effectively exponential growth of N.  Quickly it exceeds the capacity of a long (which is huge), and the test fails because the sketch is not designed to handle numbers this large.   The specific AssertionError you see here is not yet part of a release yet.  Nonetheless, the sketch will crash.  
   
   To fix the above code, do the following:
   - Change the line: `KllDoublesSketch tgt;  //getUpdatableDirectDoublesSketch(k, 0, u * numSk);` 
   to: `KllDoublesSketch tgt = getUpdatableDirectDoublesSketch(k, 0, u * numSk);`.
   - Change the line: `tgt = null;  //.reset();` to `tgt.reset();`.
   
   
   ### Comments
   - Unfortunately, your test isn't even a test of merge order, because all of the sketches in your source list are created to be exactly the same!  So even though you are shuffling the list between trials, the data being fed to the sketch is not changing (assuming the test is written correctly!), so the sketch doesn't see anything different.
   - Especially when writing tests, it is important to follow the K.I.S.S. rule and make your code as simple and transparent as possible. I will try to illustrate this with a rewrite of your code as follows:
   
   ```
     @Test
     public void betterKllMergeOrdering() {
       final int numSk = 10;
       final int u = 100;
       final int k = 200;
       final int trials = 1000;
       final KllDoublesSketch[] srcArr = new KllDoublesSketch[numSk];
       double v = 0.0;
       for(int i = 0; i < numSk; i++) {
         KllDoublesSketch src = getUpdatableDoublesSketch(k, 0);
         for (int j = 0; j < u; j++) { src.update(v += 100.0); }
         srcArr[i] = src;
       }
   
       KllDoublesSketch tgt = getUpdatableDoublesSketch(k, 0);
       for (int t = 1; t <= trials; t ++) {
         println("TRIAL: " + t);
         tgt.reset();
         shuffle(srcArr, rand);
         for (KllDoublesSketch src : srcArr) {
           tgt.merge(src);
           printNarr(srcArr);
         }
         assert tgt.getN() == numSk * u : "Test invariant of Total N has been violated.";
       }
     }
   ```
   
   - Your use of Collections, Lists and Streams is overly complex and abstract for this task and doesn't save any lines of code.   In fact, this simpler version is a few lines shorter.  Using simple for-loops and primitive arrays is not only easier to understand and debug, but it will run faster.  Collections are notoriously slow and use lots of memory.  For example, go look at the source code behind the `Collections.shuffle(...)` It is doing lots of extra data movement by first converting the collection to an array, then doing the shuffle, then moving the array back to the collection.  
   - Your test also doesn't do any real checking (perhaps because you wanted to simplify your example).  Nonetheless, as a simple, but not comprehensive check, I added an assert statement at the end that confirms that the Total N of the target sketch is just the number of sketches in your source array, times the number of values entered into each sketch.  This is a test _invariant_, as it should not change even though the values entered into each sketch may vary.
   - I also substituted the direct sketch for the heap-based sketch in the simplified code.  They both will behave the same, but the heap-based sketch is easier to see what is going on from my IDE.  The choice is up to you.  
   
   I hope this helps.
   
   Cheers,
   Lee.
   
   [JavaIssue514.java.zip](https://github.com/apache/datasketches-java/files/14440708/JavaIssue514.java.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [I] KLL improperly sensitive to merge ordering (datasketches-java)

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho commented on issue #514:
URL: https://github.com/apache/datasketches-java/issues/514#issuecomment-1977797115

   This has been resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [I] KLL improperly sensitive to merge ordering (datasketches-java)

Posted by "drewdahlke (via GitHub)" <gi...@apache.org>.
drewdahlke commented on issue #514:
URL: https://github.com/apache/datasketches-java/issues/514#issuecomment-1971666692

   Yeah I goofed on the code. The error handling tweak cleaning that up with an AssertionError would have had me doubting my code/data over suspecting there's a bug in the library, nice improvement. Thank you so much for your quick and awesome analysis! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [I] KLL improperly sensitive to merge ordering (datasketches-java)

Posted by "jmalkin (via GitHub)" <gi...@apache.org>.
jmalkin commented on issue #514:
URL: https://github.com/apache/datasketches-java/issues/514#issuecomment-1967667680

   In the meantime, thank you for the useful error report, including both a reproducing example and version range. It is very much appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org