You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2021/03/16 21:33:11 UTC

[GitHub] [kafka] cmccabe opened a new pull request #10334: MINOR: Fix BaseHashTable sizing

cmccabe opened a new pull request #10334:
URL: https://github.com/apache/kafka/pull/10334


   The array backing BaseHashTable is intended to be sized as a power of
   two.  Due to a bug, the initial array size was calculated incorrectly
   in some cases.
   
   Also make the maximum array size the largest possible 31-bit power of
   two.  Previously it was a smaller size, but this was due to a typo.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] cmccabe merged pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
cmccabe merged pull request #10334:
URL: https://github.com/apache/kafka/pull/10334


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] cmccabe commented on pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
cmccabe commented on pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#issuecomment-801303357


   Thanks for the reviews!  Will commit in a few hours if there are no more comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] jsancio commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
jsancio commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r595640017



##########
File path: metadata/src/main/java/org/apache/kafka/timeline/BaseHashTable.java
##########
@@ -56,12 +58,30 @@
         this.elements = new Object[expectedSizeToCapacity(expectedSize)];
     }
 
+    /**
+     * Calculate the capacity we should provision, given the expected size.
+     *
+     * Our capacity must always be a power of 2, and never less than 2.
+     */
     static int expectedSizeToCapacity(int expectedSize) {
-        if (expectedSize <= 1) {
-            return 2;
+        if (expectedSize >= MAX_CAPACITY / 2) {
+            return MAX_CAPACITY;
+        }
+        return Math.max(MIN_CAPACITY, roundUpToPowerOfTwo(expectedSize * 2));
+    }
+
+    private static int roundUpToPowerOfTwo(int i) {
+        if (i < 0) {
+            return 0;
         }
-        double sizeToFit = expectedSize / MAX_LOAD_FACTOR;
-        return (int) Math.min(MAX_CAPACITY, Math.ceil(Math.log(sizeToFit) / LN_2));
+        i = i - 1;
+        i |= i >> 1;
+        i |= i >> 2;
+        i |= i >> 4;
+        i |= i >> 8;
+        i |= i >> 16;
+        i = i + 1;
+        return i < 0 ? MAX_CAPACITY : i;

Review comment:
       Or https://github.com/google/guava/blob/master/guava/src/com/google/common/math/IntMath.java#L56-L72
   
   It is safe to look as it is Apache License 2.0.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] cmccabe commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
cmccabe commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r596378196



##########
File path: metadata/src/main/java/org/apache/kafka/timeline/BaseHashTable.java
##########
@@ -56,12 +61,27 @@
         this.elements = new Object[expectedSizeToCapacity(expectedSize)];
     }
 
+    /**
+     * Calculate the capacity we should provision, given the expected size.
+     *
+     * Our capacity must always be a power of 2, and never less than 2.
+     */
     static int expectedSizeToCapacity(int expectedSize) {
-        if (expectedSize <= 1) {
-            return 2;
+        if (expectedSize >= MAX_CAPACITY / 2) {
+            return MAX_CAPACITY;
+        }
+        return Math.max(MIN_CAPACITY, roundUpToPowerOfTwo(expectedSize * 2));

Review comment:
       Hash tables are designed to always have some empty slots.  The fraction of slots that is allowed to be full is called the load factor.  See https://programming.guide/hash-table-load-factor-and-capacity.html .  Increasing the load factor will save memory, but lookup time will be degraded.  The standard Java load factor is 0.75, which is what we're trying to achieve here.
   
   Actually, looking at this again, we should technically be using 0.75 in the initial expected sizing, rather than a load factor of 0.5 as it is doing now.  0.5 is more traditional but since the other expansion logic uses 0.75, best to be consistent.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] cmccabe commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
cmccabe commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r596271529



##########
File path: metadata/src/test/java/org/apache/kafka/timeline/BaseHashTableTest.java
##########
@@ -121,4 +121,19 @@ public void testExpansion() {
             assertEquals(Integer.valueOf(i), table.baseRemove(Integer.valueOf(i)));
         }
     }
+
+    @Test
+    public void testExpectedSizeToCapacity() {

Review comment:
       Hmm... I think testing 4 billion values is a bit overkill here :)
   
   I added some more test cases, though... Integer.MIN_VALUE, Integer.MAX_VALUE - 1, etc.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] cmccabe commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
cmccabe commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r596269367



##########
File path: metadata/src/main/java/org/apache/kafka/timeline/BaseHashTable.java
##########
@@ -56,12 +58,30 @@
         this.elements = new Object[expectedSizeToCapacity(expectedSize)];
     }
 
+    /**
+     * Calculate the capacity we should provision, given the expected size.
+     *
+     * Our capacity must always be a power of 2, and never less than 2.
+     */
     static int expectedSizeToCapacity(int expectedSize) {
-        if (expectedSize <= 1) {
-            return 2;
+        if (expectedSize >= MAX_CAPACITY / 2) {
+            return MAX_CAPACITY;
+        }
+        return Math.max(MIN_CAPACITY, roundUpToPowerOfTwo(expectedSize * 2));
+    }
+
+    private static int roundUpToPowerOfTwo(int i) {
+        if (i < 0) {
+            return 0;
         }
-        double sizeToFit = expectedSize / MAX_LOAD_FACTOR;
-        return (int) Math.min(MAX_CAPACITY, Math.ceil(Math.log(sizeToFit) / LN_2));
+        i = i - 1;
+        i |= i >> 1;
+        i |= i >> 2;
+        i |= i >> 4;
+        i |= i >> 8;
+        i |= i >> 16;
+        i = i + 1;
+        return i < 0 ? MAX_CAPACITY : i;

Review comment:
       Fair enough, we can use an implementation based on `numberOfLeadingZeros` :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ijuma commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
ijuma commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r595558838



##########
File path: metadata/src/main/java/org/apache/kafka/timeline/BaseHashTable.java
##########
@@ -56,12 +58,30 @@
         this.elements = new Object[expectedSizeToCapacity(expectedSize)];
     }
 
+    /**
+     * Calculate the capacity we should provision, given the expected size.
+     *
+     * Our capacity must always be a power of 2, and never less than 2.
+     */
     static int expectedSizeToCapacity(int expectedSize) {
-        if (expectedSize <= 1) {
-            return 2;
+        if (expectedSize >= MAX_CAPACITY / 2) {
+            return MAX_CAPACITY;
+        }
+        return Math.max(MIN_CAPACITY, roundUpToPowerOfTwo(expectedSize * 2));
+    }
+
+    private static int roundUpToPowerOfTwo(int i) {
+        if (i < 0) {
+            return 0;
         }
-        double sizeToFit = expectedSize / MAX_LOAD_FACTOR;
-        return (int) Math.min(MAX_CAPACITY, Math.ceil(Math.log(sizeToFit) / LN_2));
+        i = i - 1;
+        i |= i >> 1;
+        i |= i >> 2;
+        i |= i >> 4;
+        i |= i >> 8;
+        i |= i >> 16;
+        i = i + 1;
+        return i < 0 ? MAX_CAPACITY : i;

Review comment:
       Can you do something like:
   
   ```java
   static final int tableSizeFor(int cap) {
       int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
       return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
   }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
chia7712 commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r596283100



##########
File path: metadata/src/main/java/org/apache/kafka/timeline/BaseHashTable.java
##########
@@ -56,12 +61,27 @@
         this.elements = new Object[expectedSizeToCapacity(expectedSize)];
     }
 
+    /**
+     * Calculate the capacity we should provision, given the expected size.
+     *
+     * Our capacity must always be a power of 2, and never less than 2.
+     */
     static int expectedSizeToCapacity(int expectedSize) {
-        if (expectedSize <= 1) {
-            return 2;
+        if (expectedSize >= MAX_CAPACITY / 2) {
+            return MAX_CAPACITY;
+        }
+        return Math.max(MIN_CAPACITY, roundUpToPowerOfTwo(expectedSize * 2));
+    }
+
+    private static int roundUpToPowerOfTwo(int i) {
+        if (i <= 0) {
+            return 0;
+        } else if (i > MAX_SIGNED_POWER_OF_TWO) {

Review comment:
       Are all negative numbers acceptable? If so, the negative numbers which get positive by overflow (like `-1610612735`) encounter error.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] jsancio commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
jsancio commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r596315922



##########
File path: metadata/src/main/java/org/apache/kafka/timeline/BaseHashTable.java
##########
@@ -56,12 +61,27 @@
         this.elements = new Object[expectedSizeToCapacity(expectedSize)];
     }
 
+    /**
+     * Calculate the capacity we should provision, given the expected size.
+     *
+     * Our capacity must always be a power of 2, and never less than 2.
+     */
     static int expectedSizeToCapacity(int expectedSize) {
-        if (expectedSize <= 1) {
-            return 2;
+        if (expectedSize >= MAX_CAPACITY / 2) {
+            return MAX_CAPACITY;
+        }
+        return Math.max(MIN_CAPACITY, roundUpToPowerOfTwo(expectedSize * 2));

Review comment:
       @cmccabe Why do we double the `expectedSize`? For example, why is the capacity `8` when the expected size `3`? Shouldn't the capacity be `4` when the expected size is `3`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] cmccabe commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
cmccabe commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r596375185



##########
File path: metadata/src/main/java/org/apache/kafka/timeline/BaseHashTable.java
##########
@@ -56,12 +61,27 @@
         this.elements = new Object[expectedSizeToCapacity(expectedSize)];
     }
 
+    /**
+     * Calculate the capacity we should provision, given the expected size.
+     *
+     * Our capacity must always be a power of 2, and never less than 2.
+     */
     static int expectedSizeToCapacity(int expectedSize) {
-        if (expectedSize <= 1) {
-            return 2;
+        if (expectedSize >= MAX_CAPACITY / 2) {
+            return MAX_CAPACITY;
+        }
+        return Math.max(MIN_CAPACITY, roundUpToPowerOfTwo(expectedSize * 2));
+    }
+
+    private static int roundUpToPowerOfTwo(int i) {
+        if (i <= 0) {
+            return 0;
+        } else if (i > MAX_SIGNED_POWER_OF_TWO) {

Review comment:
       This function just returns 0 if the number is less than or equal to 0, so we don't have to consider that case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] cmccabe commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
cmccabe commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r597064603



##########
File path: metadata/src/main/java/org/apache/kafka/timeline/BaseHashTable.java
##########
@@ -56,12 +61,27 @@
         this.elements = new Object[expectedSizeToCapacity(expectedSize)];
     }
 
+    /**
+     * Calculate the capacity we should provision, given the expected size.
+     *
+     * Our capacity must always be a power of 2, and never less than 2.
+     */
     static int expectedSizeToCapacity(int expectedSize) {
-        if (expectedSize <= 1) {
-            return 2;
+        if (expectedSize >= MAX_CAPACITY / 2) {
+            return MAX_CAPACITY;
+        }
+        return Math.max(MIN_CAPACITY, roundUpToPowerOfTwo(expectedSize * 2));
+    }
+
+    private static int roundUpToPowerOfTwo(int i) {
+        if (i <= 0) {
+            return 0;
+        } else if (i > MAX_SIGNED_POWER_OF_TWO) {

Review comment:
       Good catch.  Yes, this should be fixed now in the latest logic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
chia7712 commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r596497253



##########
File path: metadata/src/main/java/org/apache/kafka/timeline/BaseHashTable.java
##########
@@ -56,12 +56,29 @@
         this.elements = new Object[expectedSizeToCapacity(expectedSize)];
     }
 
+    /**
+     * Calculate the capacity we should provision, given the expected size.
+     *
+     * Our capacity must always be a power of 2, and never less than 2 or more
+     * than MAX_CAPACITY.  We use 64-bit numbers here to avoid overflow
+     * concerns.
+     */
+    @SuppressWarnings("unchecked")

Review comment:
       Is this necessary? my IDE says it is redundant.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
chia7712 commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r596496496



##########
File path: metadata/src/main/java/org/apache/kafka/timeline/BaseHashTable.java
##########
@@ -56,12 +61,27 @@
         this.elements = new Object[expectedSizeToCapacity(expectedSize)];
     }
 
+    /**
+     * Calculate the capacity we should provision, given the expected size.
+     *
+     * Our capacity must always be a power of 2, and never less than 2.
+     */
     static int expectedSizeToCapacity(int expectedSize) {
-        if (expectedSize <= 1) {
-            return 2;
+        if (expectedSize >= MAX_CAPACITY / 2) {
+            return MAX_CAPACITY;
+        }
+        return Math.max(MIN_CAPACITY, roundUpToPowerOfTwo(expectedSize * 2));
+    }
+
+    private static int roundUpToPowerOfTwo(int i) {
+        if (i <= 0) {
+            return 0;
+        } else if (i > MAX_SIGNED_POWER_OF_TWO) {

Review comment:
       Sorry for my unclear comment.
   
   I assumed that all negative numbers are acceptable for `expectedSizeToCapacity` because of test case `assertEquals(2, BaseHashTable.expectedSizeToCapacity(-123));`. However, some magic numbers (for example, `-1610612735`) can produce `ArithmeticException`.
   
   At any rate, above case are nonexistent in latest commit now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on a change in pull request #10334: MINOR: Fix BaseHashTable sizing

Posted by GitBox <gi...@apache.org>.
chia7712 commented on a change in pull request #10334:
URL: https://github.com/apache/kafka/pull/10334#discussion_r595891567



##########
File path: metadata/src/test/java/org/apache/kafka/timeline/BaseHashTableTest.java
##########
@@ -121,4 +121,19 @@ public void testExpansion() {
             assertEquals(Integer.valueOf(i), table.baseRemove(Integer.valueOf(i)));
         }
     }
+
+    @Test
+    public void testExpectedSizeToCapacity() {

Review comment:
       As it was a bug, maybe we should check all available numbers. For example:
   ```java
           IntStream.range(Integer.MIN_VALUE, Integer.MAX_VALUE).forEach(i -> {
               int result = BaseHashTable.expectedSizeToCapacity(i);
               assertTrue(result >= 2);
               assertTrue(result <= BaseHashTable.MAX_CAPACITY);
               assertEquals(0, result % 2);
           });
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org