You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by "DomGarguilo (via GitHub)" <gi...@apache.org> on 2023/05/23 21:12:04 UTC

[GitHub] [accumulo] DomGarguilo opened a new pull request, #3424: Improve performance of BulkImport.estimateSizes()

DomGarguilo opened a new pull request, #3424:
URL: https://github.com/apache/accumulo/pull/3424

   This PR addresses a TODO about using a binary search. In addition I moved the `FileSKVIterator index` to the try-with-resources block.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] DomGarguilo commented on a diff in pull request #3424: Improve performance of BulkImport.estimateSizes()

Posted by "DomGarguilo (via GitHub)" <gi...@apache.org>.

DomGarguilo commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1207199616


##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
-            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
+            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   yea good point. I have reverted the binary search here and removed the comment about it. I kept the change adding the `FileSKVIterator` to try-with-resources block.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner commented on a diff in pull request #3424: Minor improvements to BulkImport.estimateSizes()

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.

keith-turner commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1207218277


##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
-            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
+            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   > The biggest problem here is you are defeating the entire purpose by creating a new ArrayList.
   
   I don't think creating the ArrayList was necessarily a perf problem because it was done once outside of a loop and the binary search was done many times inside a loop.
   
   @DomGarguilo I unresolved this in case you wanted to look into using tailMap in the TreeMap.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner commented on a diff in pull request #3424: Minor improvements to BulkImport.estimateSizes()

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.

keith-turner commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1207218277


##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
-            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
+            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   > The biggest problem here is you are defeating the entire purpose by creating a new ArrayList.
   
   I don't think creating the ArrayList was necessarily a perf problem because it was done once outside of a loop and the binary search was done many times inside a loop.
   
   @DomGarguilo I unresolved this in case you wanted to look into using tailMap in the TreeMap.  Can always resolve this comment if not interested in looking into that.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] DomGarguilo commented on a diff in pull request #3424: Improve performance of BulkImport.estimateSizes()

Posted by "DomGarguilo (via GitHub)" <gi...@apache.org>.

DomGarguilo commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1204343776


##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
-            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
+            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   In order to use `Collections.binarySearch()` we need to provide a `KeyExtent` to search for, but in this case we only have the row. So unless I'm overlooking something, I don't think `Collections.binarySearch()` will work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] cshannon commented on a diff in pull request #3424: Improve performance of BulkImport.estimateSizes()

Posted by "cshannon (via GitHub)" <gi...@apache.org>.

cshannon commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1207101071


##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
-            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
+            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   The biggest problem here is you are defeating the entire purpose by creating a new ArrayList. Creating a new collection copies all the entries so at that point you are actually just iterating over the entire collection anyways, and then now have to iterate again later. So this change actually makes things slower and uses more memory than before.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii merged pull request #3424: Minor improvements to BulkImport.estimateSizes()

Posted by "ctubbsii (via GitHub)" <gi...@apache.org>.

ctubbsii merged PR #3424:
URL: https://github.com/apache/accumulo/pull/3424


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] cshannon commented on a diff in pull request #3424: Improve performance of BulkImport.estimateSizes()

Posted by "cshannon (via GitHub)" <gi...@apache.org>.

cshannon commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1207101071


##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
-            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
+            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   The biggest problem here is you are defeating the entire purpose by creating a new ArrayList. Creating a new collection copies all the entries so at that point you are actually just iterating over the entire collection anyways, and then now have to iterate again later. So this change actually makes things slower and use more memory than before.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on a diff in pull request #3424: Minor improvements to BulkImport.estimateSizes()

Posted by "ctubbsii (via GitHub)" <gi...@apache.org>.

ctubbsii commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1207262927


##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
-            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
+            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   Since this is merged, if anything further is done, it should be in a new PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on a diff in pull request #3424: Improve performance of BulkImport.estimateSizes()

Posted by "EdColeman (via GitHub)" <gi...@apache.org>.

EdColeman commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1203051788


##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
-            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
+            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   Why not Collections.binarySearch() on sortedKeyExtents list?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner commented on a diff in pull request #3424: Minor improvements to BulkImport.estimateSizes()

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.

keith-turner commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1207211859


##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
-            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, ns, ns.getConf(), cs)
+            .withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   The comment may have been referring to leveraging the `counts` TreeMap to do a binary search.   The following example illustrates some concepts that may be used to do this.  I ran it locally and it prints what I would expect.
   
   ```
       static KeyExtent nke(String er, String per) {
           return new KeyExtent(TableId.of("1"), er == null ? null : new Text(er), per == null ? null : new Text(per));
       }
   
       public static void main(String[] args) {
           TreeMap<KeyExtent, String> tm = new TreeMap<>();
   
           tm.put(nke("c",null), "1");
           tm.put(nke("f","c"), "2");
           tm.put(nke("m","f"), "3");
           tm.put(nke("ma","m"), "4");
           tm.put(nke(null,"ma"), "5");
   
           System.out.println("1 "+tm.ceilingEntry(nke("b",null)).getValue());
           System.out.println("1 "+tm.ceilingEntry(nke("c",null)).getValue());
           System.out.println("2 "+tm.ceilingEntry(nke("d",null)).getValue());
           System.out.println("2 "+tm.ceilingEntry(nke("f",null)).getValue());
           System.out.println("3 "+tm.ceilingEntry(nke("g",null)).getValue());
           System.out.println("3 "+tm.ceilingEntry(nke("m",null)).getValue());
           System.out.println("4 "+tm.ceilingEntry(nke("m ",null)).getValue());
           System.out.println("4 "+tm.ceilingEntry(nke("ma",null)).getValue());
           System.out.println("5 "+tm.ceilingEntry(nke("mb",null)).getValue());
           System.out.println("5 "+tm.ceilingEntry(nke(null,null)).getValue());
       }
   ```
   
   Always setting prevEndRow to null for the lookup exent in the example above is something I think is done elsewhere in the Accumulo code, but not sure.  I looked at the comparator for KeyExtent and I think its correct, but it would be good to double check that.
   
   In the example above the extents go from -inf to +inf with no gaps and no overlap of the extent. So that example does not capture the complexity of this code. The existing loop in the code handles gaps and overlaps w/o issue because it checks each extent for overlap.  I don't know the answer to this, but could we use the tailMap (and maybe headMap) functions of TreeMap  to efficiently find the extents that overlap?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org