You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/19 05:23:54 UTC

[GitHub] [spark] WangGuangxin opened a new pull request, #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY

WangGuangxin opened a new pull request, #38722:
URL: https://github.com/apache/spark/pull/38722

   ### What changes were proposed in this pull request?
   In BytesToBytesMap, the longArray size can be up to `MAX_CAPACITY` instead `MAX_CAPACITY/2` since `MAX_CAPACITY` already take `two array entries per key` into account.
   
   ### Why are the changes needed?
   To handle larger dataset in BytestoBytesMap
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existings UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WangGuangxin closed pull request #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY

Posted by GitBox <gi...@apache.org>.

WangGuangxin closed pull request #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY
URL: https://github.com/apache/spark/pull/38722


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WangGuangxin commented on pull request #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY

Posted by GitBox <gi...@apache.org>.

WangGuangxin commented on PR #38722:
URL: https://github.com/apache/spark/pull/38722#issuecomment-1320866342

   cc @JoshRosen @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WangGuangxin commented on a diff in pull request #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY

Posted by GitBox <gi...@apache.org>.

WangGuangxin commented on code in PR #38722:
URL: https://github.com/apache/spark/pull/38722#discussion_r1027903483


##########
core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java:
##########
@@ -812,9 +812,7 @@ public boolean append(Object kbase, long koff, int klen, Object vbase, long voff
 
         // If the map has reached its growth threshold, try to grow it.
         if (numKeys >= growthThreshold) {
-          // We use two array entries per key, so the array size is twice the capacity.
-          // We should compare the current capacity of the array, instead of its size.
-          if (longArray.size() / 2 < MAX_CAPACITY) {
+          if (longArray.size() < MAX_CAPACITY) {

Review Comment:
   yes, I think you are right, I misunderstood this. close it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #38722:
URL: https://github.com/apache/spark/pull/38722#discussion_r1027662257


##########
core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java:
##########
@@ -812,9 +812,7 @@ public boolean append(Object kbase, long koff, int klen, Object vbase, long voff
 
         // If the map has reached its growth threshold, try to grow it.
         if (numKeys >= growthThreshold) {
-          // We use two array entries per key, so the array size is twice the capacity.
-          // We should compare the current capacity of the array, instead of its size.
-          if (longArray.size() / 2 < MAX_CAPACITY) {
+          if (longArray.size() < MAX_CAPACITY) {

Review Comment:
   I think we need to read the code of this entire class to understand the capacity. According to the code comment
   ```
     /**
      * The maximum number of keys that BytesToBytesMap supports. The hash table has to be
      * power-of-2-sized and its backing Java array can contain at most (1 &lt;&lt; 30) elements,
      * since that's the largest power-of-2 that's less than Integer.MAX_VALUE. We need two long array
      * entries per key, giving us a maximum capacity of (1 &lt;&lt; 29).
      */
     public static final int MAX_CAPACITY = (1 << 29);
   ```
   
   This is to restrict the number of keys, and `longArray.size() / 2` is the number of keys. The code here should be correct.
   
   I also saw code like this
   ```
   private void allocate(int capacity) {
       assert (capacity >= 0);
       capacity = Math.max((int) Math.min(MAX_CAPACITY, ByteArrayMethods.nextPowerOf2(capacity)), 64);
       assert (capacity <= MAX_CAPACITY);
       longArray = allocateArray(capacity * 2L);
       longArray.zeroOut();
   
       this.growthThreshold = (int) (capacity * loadFactor);
       this.mask = capacity - 1;
     }
   ```
   
   `longArray` is 2x capacity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #38722:
URL: https://github.com/apache/spark/pull/38722#issuecomment-1321183990

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org