You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by GitBox <gi...@apache.org> on 2022/10/19 17:28:26 UTC

[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999752146


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -474,11 +477,21 @@ public synchronized void requestTargetNumberOfContainers(YarnContainerRequestBun
     for (Map.Entry<String, Integer> entry : yarnContainerRequestBundle.getHelixTagContainerCountMap().entrySet()) {
       String currentHelixTag = entry.getKey();
       int desiredContainerCount = entry.getValue();
+      Resource resourceForHelixTag = yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag);
+
       // Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn
-      int requestedContainerCount = allocatedContainerCountMap.getOrDefault(currentHelixTag, 0)
-          + getMatchingRequestsCount(yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
-      for(; requestedContainerCount < desiredContainerCount; requestedContainerCount++) {
-        requestContainer(Optional.absent(), yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
+      allocatedContainerCountMap.putIfAbsent(currentHelixTag, new AtomicInteger(0));

Review Comment:
   The main logical change is changing the value type from `Integer` to `AtomicInteger`. This prevents potential desync between our allocatedContainerCountMap (HelixTag -> numCountainers) and the containerMap (containerId -> Container POJO) due to race condition.
   
   I suspect the incorrect value for allocated container map is one of the reasons we saw incorrect behavior with shrink + allocating. This edge case would be generally rare and that's why we only start to see issues after a large number of days and the issue is resolved after a restart. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org