You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/13 22:04:04 UTC

[GitHub] [spark] viirya commented on a change in pull request #29744: [SPARK-32872][CORE] Prevent BytesToBytesMap at MAX_CAPACITY from exceeding growth threshold

viirya commented on a change in pull request #29744:
URL: https://github.com/apache/spark/pull/29744#discussion_r487582849



##########
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##########
@@ -816,6 +816,10 @@ public boolean append(Object kbase, long koff, int klen, Object vbase, long voff
           } catch (SparkOutOfMemoryError oom) {
             canGrowArray = false;
           }
+        } else if (numKeys >= growthThreshold && longArray.size() / 2 >= MAX_CAPACITY) {
+          // The map cannot grow because doing so would cause it to exceed MAX_CAPACITY. Instead, we
+          // prevent the map from accepting any more new elements.
+          canGrowArray = false;

Review comment:
       I think we won't exceed `MAX_CAPACITY` at all. We have some restrictions to prevent it.
   
   The root cause, I think, is once the map reaches `MAX_CAPACITY`, we won't grow the map even the `numKeys` is more than `growthThreshold`. So we could add elements to the map to more than load factor (0.5) of it. The map cannot be used for sorting later in `UnsafeKVExternalSorter`. Although `UnsafeKVExternalSorter` will allocate new array in the case, it could hit OOM and fails the query.
   
   This change favors reusing the array for sorting, but it prevents the map accepting new elements more early, half of the array is not used after the map capacity reaches `MAX_CAPACITY`. But by doing this, we will do spilling early too.
   
   When I was fixing the customer application before, we don't see the OOM even the payload reached `MAX_CAPACITY`. Is it possibly due to the memory setting on your payload?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org