You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "keith-turner (via GitHub)" <gi...@apache.org> on 2023/05/26 19:01:47 UTC

[GitHub] [accumulo] keith-turner commented on a diff in pull request #3424: Minor improvements to BulkImport.estimateSizes()

keith-turner commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1207211859


##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
-            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
+            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   The comment may have been referring to leveraging the `counts` TreeMap to do a binary search.   The following example illustrates some concepts that may be used to do this.  I ran it locally and it prints what I would expect.
   
   ```
       static KeyExtent nke(String er, String per) {
           return new KeyExtent(TableId.of("1"), er == null ? null : new Text(er), per == null ? null : new Text(per));
       }
   
       public static void main(String[] args) {
           TreeMap<KeyExtent, String> tm = new TreeMap<>();
   
           tm.put(nke("c",null), "1");
           tm.put(nke("f","c"), "2");
           tm.put(nke("m","f"), "3");
           tm.put(nke("ma","m"), "4");
           tm.put(nke(null,"ma"), "5");
   
           System.out.println("1 "+tm.ceilingEntry(nke("b",null)).getValue());
           System.out.println("1 "+tm.ceilingEntry(nke("c",null)).getValue());
           System.out.println("2 "+tm.ceilingEntry(nke("d",null)).getValue());
           System.out.println("2 "+tm.ceilingEntry(nke("f",null)).getValue());
           System.out.println("3 "+tm.ceilingEntry(nke("g",null)).getValue());
           System.out.println("3 "+tm.ceilingEntry(nke("m",null)).getValue());
           System.out.println("4 "+tm.ceilingEntry(nke("m ",null)).getValue());
           System.out.println("4 "+tm.ceilingEntry(nke("ma",null)).getValue());
           System.out.println("5 "+tm.ceilingEntry(nke("mb",null)).getValue());
           System.out.println("5 "+tm.ceilingEntry(nke(null,null)).getValue());
       }
   ```
   
   Always setting prevEndRow to null for the lookup exent in the example above is something I think is done elsewhere in the Accumulo code, but not sure.  I looked at the comparator for KeyExtent and I think its correct, but it would be good to double check that.
   
   In the example above the extents go from -inf to +inf with no gaps and no overlap of the extent. So that example does not capture the complexity of this code. The existing loop in the code handles gaps and overlaps w/o issue because it checks each extent for overlap.  I don't know the answer to this, but could we use the tailMap (and maybe headMap) functions of TreeMap  to efficiently find the extents that overlap?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org