You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by GitBox <gi...@apache.org> on 2022/10/18 23:13:19 UTC

[GitHub] [gobblin] homatthew opened a new pull request, #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

homatthew opened a new pull request, #3586:
URL: https://github.com/apache/gobblin/pull/3586

   Due to race condition and incorrect API call, the yarn service both allocates too many containers and in rare cases has a discrepancy between the number of containers state due to race conditions.
   
   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
   
   
   ### JIRA
   - [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-XXX
   
   
   ### Description
   - [ ] Here are some details about my PR, including screenshots (if applicable):
   
   
   ### Tests
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999759729


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -845,8 +877,6 @@ public void onContainersAllocated(List<Container> containers) {
               instanceName = null;
             }
           }
-          allocatedContainerCountMap.put(containerHelixTag,

Review Comment:
   I opted for the atomic integer approach because it is easy and I don't need to touch any of the other spaghetti.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999759167


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -845,8 +877,6 @@ public void onContainersAllocated(List<Container> containers) {
               instanceName = null;
             }
           }
-          allocatedContainerCountMap.put(containerHelixTag,

Review Comment:
   Putting this block inside of the synchronized block works is kind of confusing. I am not entirely convinced it does what we want it to. (Does synchronized sync on the `AmrmClient` or does it sync on the `YarnService`. I think it would be the former which is incorrect.
   
   With this being a relatively cold part of the code (once every 90 seconds?), we really don't need any fine-grained locking. An atomic integer would be sufficient or even a global lock for all instances of the class. (YarnService is a single thread and then there is a callback for each container allocated. Container allocation / deallocation doesn't happen that much especially after the pipeline is up for a few hours. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] hanghangliu commented on pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
hanghangliu commented on PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#issuecomment-1287510155

   +1, LGTM once you can resolve Zihan's comments 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999770296


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -789,11 +815,17 @@ private ImmutableMap.Builder<String, String> buildContainerStatusEventMetadata(C
 
   /**
    * Get the number of matching container requests for the specified resource memory and cores.
+   * Due to YARN-1902, this API is not 100% accurate. However, {@link AMRMClientCallbackHandler#onContainersAllocated(List)}
+   * contains logic for best effort clean up of duplicate requests, and in practice is pretty accurate.

Review Comment:
   This comment is in reference to line `892`. The accuracy of this API request is actually very subtle due to [`Yarn-1902`](https://issues.apache.org/jira/browse/YARN-1902). The workaround is best effort but in my testing was surprisingly accurate. 
   
   While testing, I deleted the if statement code block near `892` and the number of container requests exploded. See the details in the ticket if anyone is curious



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] hanghangliu commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
hanghangliu commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001137288


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -845,8 +877,6 @@ public void onContainersAllocated(List<Container> containers) {
               instanceName = null;
             }
           }
-          allocatedContainerCountMap.put(containerHelixTag,

Review Comment:
   Yes, so your change should be better as it add protection when decreasing the container. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001045933


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -857,6 +887,8 @@ public void onContainersAllocated(List<Container> containers) {
 
         ContainerInfo containerInfo = new ContainerInfo(container, instanceName, containerHelixTag);
         containerMap.put(container.getId(), containerInfo);
+        allocatedContainerCountMap.putIfAbsent(containerHelixTag, new AtomicInteger(0));

Review Comment:
   Oh actually. I think this code is needed because we have code for requesting initial containers. And that does not go through the line 483. I think that is separate



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001225111


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java:
##########
@@ -266,7 +266,7 @@ void runInternal() {
       }
       slidingWindowReservoir.add(yarnContainerRequestBundle);
 
-      log.debug("There are {} containers being requested in total, tag-count map {}, tag-resource map {}",
+      log.info("There are {} containers being requested in total, tag-count map {}, tag-resource map {}",

Review Comment:
   I'll remove this change and set back to debug. I don't see much value in the tag-resource map actually



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001083931


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -845,8 +877,6 @@ public void onContainersAllocated(List<Container> containers) {
               instanceName = null;
             }
           }
-          allocatedContainerCountMap.put(containerHelixTag,

Review Comment:
   Another approach is to just add synchronization in `onContainersCompleted`. It doesn't make sense to me that handleCompletion method has no concurrency safe guards and can freely modify the inUsedInstances set while this method is iterating over the set. 
   
   I am hesitant to change too much here because "it works". But at the same time I'd love to make the usage of concurrency primitives consistent so it's not so confusing. 🤷 
   
   Even in the original PR there is a question about usage of explicit lock vs intrinsic lock https://github.com/apache/gobblin/pull/3059



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999766182


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -789,11 +815,17 @@ private ImmutableMap.Builder<String, String> buildContainerStatusEventMetadata(C
 
   /**
    * Get the number of matching container requests for the specified resource memory and cores.
+   * Due to YARN-1902, this API is not 100% accurate. However, {@link AMRMClientCallbackHandler#onContainersAllocated(List)}
+   * contains logic for best effort clean up of duplicate requests, and in practice is pretty accurate.
    */
   private int getMatchingRequestsCount(Resource resource) {
     int priorityNum = resourcePriorityMap.getOrDefault(resource.toString(), 0);
     Priority priority = Priority.newInstance(priorityNum);
-    return getAmrmClientAsync().getMatchingRequests(priority, ResourceRequest.ANY, resource).size();
+    List<? extends Collection> outstandingRequests = getAmrmClientAsync().getMatchingRequests(priority, ResourceRequest.ANY, resource);

Review Comment:
   The other logical change is that this API call `getMatchingRequests` returns a `List<? extends Collection>` and we were calculating the number of outstanding requests incorrectly. 
   
   If we just return the size of this, we will always return a small value because the API returns a list of collections, where each collection is the requests partitioned by the node ID. So
   
   The pending requests count should be the sum of all outstanding requests. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001052159


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -845,8 +877,6 @@ public void onContainersAllocated(List<Container> containers) {
               instanceName = null;
             }
           }
-          allocatedContainerCountMap.put(containerHelixTag,

Review Comment:
   I re-read the API and I think you're correct. https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-client/apidocs/org/apache/hadoop/yarn/client/api/async/AMRMClientAsync.CallbackHandler.html
   
   Because this method is triggered by heartbeat and this method is called after the `onContainersCompleted`, maybe any implementation would work here since it's not even concurrent lol. All writes to this map actually seem single threaded 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] codecov-commenter commented on pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#issuecomment-1283146924

   # [Codecov](https://codecov.io/gh/apache/gobblin/pull/3586?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#3586](https://codecov.io/gh/apache/gobblin/pull/3586?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9982cc4) into [master](https://codecov.io/gh/apache/gobblin/commit/52d91635515cb9056a616ea237f4276121c89b64?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (52d9163) will **increase** coverage by `3.91%`.
   > The diff coverage is `5.71%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #3586      +/-   ##
   ============================================
   + Coverage     46.87%   50.79%   +3.91%     
   + Complexity    10660     5806    -4854     
   ============================================
     Files          2117     1061    -1056     
     Lines         82964    40341   -42623     
     Branches       9242     4517    -4725     
   ============================================
   - Hits          38889    20490   -18399     
   + Misses        40513    18086   -22427     
   + Partials       3562     1765    -1797     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/gobblin/pull/3586?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...n/java/org/apache/gobblin/yarn/YarnHelixUtils.java](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFybkhlbGl4VXRpbHMuamF2YQ==) | `23.25% <0.00%> (-0.28%)` | :arrow_down: |
   | [...main/java/org/apache/gobblin/yarn/YarnService.java](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFyblNlcnZpY2UuamF2YQ==) | `15.03% <3.12%> (-0.62%)` | :arrow_down: |
   | [...rg/apache/gobblin/yarn/YarnAutoScalingManager.java](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFybkF1dG9TY2FsaW5nTWFuYWdlci5qYXZh) | `58.06% <100.00%> (ø)` | |
   | [.../apache/gobblin/runtime/api/JobExecutionState.java](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1ydW50aW1lL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3J1bnRpbWUvYXBpL0pvYkV4ZWN1dGlvblN0YXRlLmphdmE=) | `79.43% <0.00%> (-0.94%)` | :arrow_down: |
   | [...n/converter/avro/JsonElementConversionFactory.java](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jb3JlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NvbnZlcnRlci9hdnJvL0pzb25FbGVtZW50Q29udmVyc2lvbkZhY3RvcnkuamF2YQ==) | `80.97% <0.00%> (-0.08%)` | :arrow_down: |
   | [...blin/converter/filter/AvroFieldsPickConverter.java](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jb3JlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NvbnZlcnRlci9maWx0ZXIvQXZyb0ZpZWxkc1BpY2tDb252ZXJ0ZXIuamF2YQ==) | `85.89% <0.00%> (ø)` | |
   | [...bblin/converter/filter/AvroSchemaFieldRemover.java](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jb3JlLWJhc2Uvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vY29udmVydGVyL2ZpbHRlci9BdnJvU2NoZW1hRmllbGRSZW1vdmVyLmphdmE=) | `94.64% <0.00%> (ø)` | |
   | [...ement/conversion/hive/utils/AvroHiveTypeUtils.java](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1kYXRhLW1hbmFnZW1lbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZGF0YS9tYW5hZ2VtZW50L2NvbnZlcnNpb24vaGl2ZS91dGlscy9BdnJvSGl2ZVR5cGVVdGlscy5qYXZh) | `66.95% <0.00%> (ø)` | |
   | [...nversion/hive/query/HiveAvroORCQueryGenerator.java](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1kYXRhLW1hbmFnZW1lbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZGF0YS9tYW5hZ2VtZW50L2NvbnZlcnNpb24vaGl2ZS9xdWVyeS9IaXZlQXZyb09SQ1F1ZXJ5R2VuZXJhdG9yLmphdmE=) | `65.89% <0.00%> (ø)` | |
   | [...service/modules/orchestration/FSDagStateStore.java](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3NlcnZpY2UvbW9kdWxlcy9vcmNoZXN0cmF0aW9uL0ZTRGFnU3RhdGVTdG9yZS5qYXZh) | | |
   | ... and [1059 more](https://codecov.io/gh/apache/gobblin/pull/3586/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999761558


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -474,11 +477,21 @@ public synchronized void requestTargetNumberOfContainers(YarnContainerRequestBun
     for (Map.Entry<String, Integer> entry : yarnContainerRequestBundle.getHelixTagContainerCountMap().entrySet()) {
       String currentHelixTag = entry.getKey();
       int desiredContainerCount = entry.getValue();
+      Resource resourceForHelixTag = yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag);
+
       // Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn
-      int requestedContainerCount = allocatedContainerCountMap.getOrDefault(currentHelixTag, 0)
-          + getMatchingRequestsCount(yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
-      for(; requestedContainerCount < desiredContainerCount; requestedContainerCount++) {
-        requestContainer(Optional.absent(), yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
+      allocatedContainerCountMap.putIfAbsent(currentHelixTag, new AtomicInteger(0));

Review Comment:
   The rest of the changes are logging / code smell changes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001045933


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -857,6 +887,8 @@ public void onContainersAllocated(List<Container> containers) {
 
         ContainerInfo containerInfo = new ContainerInfo(container, instanceName, containerHelixTag);
         containerMap.put(container.getId(), containerInfo);
+        allocatedContainerCountMap.putIfAbsent(containerHelixTag, new AtomicInteger(0));

Review Comment:
   Oh actually. I think this code is needed because we have code for requesting initial containers. And that does not go through the line 483. I think that is separate



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999759167


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -845,8 +877,6 @@ public void onContainersAllocated(List<Container> containers) {
               instanceName = null;
             }
           }
-          allocatedContainerCountMap.put(containerHelixTag,

Review Comment:
   Putting this block inside of the synchronized block works is kind of confusing. I am not entirely convinced it does what we want it to. I think it would lock on the callbackhandler which does prevent concurrent modifications if the caller is using this method but not others (see onContainerComplete which has no safeguards)
   
   With this being a relatively cold part of the code (once every 90 seconds?), we really don't need any fine-grained locking. An atomic integer would be sufficient or even a global lock for all instances of the class. (YarnService is a single thread and then there is a callback for each container allocated. Container allocation / deallocation doesn't happen that much especially after the pipeline is up for a few hours. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999760735


##########
gobblin-yarn/src/test/java/org/apache/gobblin/yarn/YarnServiceTest.java:
##########
@@ -316,6 +316,7 @@ private static HelixManager getMockHelixManager(Config config) {
       Mockito.when(helixManager.getClusterName()).thenReturn(config.getString(GobblinClusterConfigurationKeys.HELIX_CLUSTER_NAME_KEY));
 
       Mockito.when(helixManager.getHelixDataAccessor()).thenReturn(helixDataAccessor);
+      Mockito.when(helixManager.getMetadataStoreConnectionString()).thenReturn("stub");

Review Comment:
   YarnService test doesn't run without this mock. I accidentally broke this test in https://github.com/apache/gobblin/pull/3561 but this wasn't caught because this test doesn't run in CI due to flakiness and slowness



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] hanghangliu commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
hanghangliu commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1000919913


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java:
##########
@@ -266,7 +266,7 @@ void runInternal() {
       }
       slidingWindowReservoir.add(yarnContainerRequestBundle);
 
-      log.debug("There are {} containers being requested in total, tag-count map {}, tag-resource map {}",
+      log.info("There are {} containers being requested in total, tag-count map {}, tag-resource map {}",

Review Comment:
   This log seems very similar to the log in YarnService.requestTargetNumberOfContainers() "Current tag-container desired count:". Maybe we can just use that one and add  tag-resource map there?



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -474,11 +477,21 @@ public synchronized void requestTargetNumberOfContainers(YarnContainerRequestBun
     for (Map.Entry<String, Integer> entry : yarnContainerRequestBundle.getHelixTagContainerCountMap().entrySet()) {
       String currentHelixTag = entry.getKey();
       int desiredContainerCount = entry.getValue();
+      Resource resourceForHelixTag = yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag);
+
       // Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn
-      int requestedContainerCount = allocatedContainerCountMap.getOrDefault(currentHelixTag, 0)
-          + getMatchingRequestsCount(yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
-      for(; requestedContainerCount < desiredContainerCount; requestedContainerCount++) {
-        requestContainer(Optional.absent(), yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
+      allocatedContainerCountMap.putIfAbsent(currentHelixTag, new AtomicInteger(0));
+      int allocatedContainers = allocatedContainerCountMap.get(currentHelixTag).get();
+      int outstandingContainerRequests = getMatchingRequestsCount(resourceForHelixTag);
+      int requestedContainerCount = allocatedContainers + outstandingContainerRequests;
+      int numContainersNeeded = desiredContainerCount - requestedContainerCount;
+      LOGGER.info("helixTag={}, allocatedContainers={}, outstandingContainerRequests={}, desiredContainerCount={}, numContainersNeeded={}",
+          currentHelixTag, allocatedContainers, outstandingContainerRequests, desiredContainerCount, numContainersNeeded);
+
+      if (numContainersNeeded > 0) {
+        requestContainers(numContainersNeeded, resourceForHelixTag);
+      } else {
+        LOGGER.info("Not requesting any containers because numContainersNeeded={} which is not > 0", numContainersNeeded);

Review Comment:
   maybe can set this as debug? As the previous log should already indicate the behavior 



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -857,6 +887,8 @@ public void onContainersAllocated(List<Container> containers) {
 
         ContainerInfo containerInfo = new ContainerInfo(container, instanceName, containerHelixTag);
         containerMap.put(container.getId(), containerInfo);
+        allocatedContainerCountMap.putIfAbsent(containerHelixTag, new AtomicInteger(0));

Review Comment:
   Not sure if this is still needed, as it should already been added in line 483 inside requestTargetNumberOfContainers. But no harm to add though 



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -474,11 +477,21 @@ public synchronized void requestTargetNumberOfContainers(YarnContainerRequestBun
     for (Map.Entry<String, Integer> entry : yarnContainerRequestBundle.getHelixTagContainerCountMap().entrySet()) {
       String currentHelixTag = entry.getKey();
       int desiredContainerCount = entry.getValue();
+      Resource resourceForHelixTag = yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag);
+
       // Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn
-      int requestedContainerCount = allocatedContainerCountMap.getOrDefault(currentHelixTag, 0)
-          + getMatchingRequestsCount(yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
-      for(; requestedContainerCount < desiredContainerCount; requestedContainerCount++) {
-        requestContainer(Optional.absent(), yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
+      allocatedContainerCountMap.putIfAbsent(currentHelixTag, new AtomicInteger(0));
+      int allocatedContainers = allocatedContainerCountMap.get(currentHelixTag).get();
+      int outstandingContainerRequests = getMatchingRequestsCount(resourceForHelixTag);
+      int requestedContainerCount = allocatedContainers + outstandingContainerRequests;

Review Comment:
   allocatedContainers looks confusing as we have another variable numAllocatedContainers. Maybe just directly calculate requestedContainerCount = allocatedContainerCountMap.get(currentHelixTag).get() + getMatchingRequestsCount(resourceForHelixTag); ? Or you have any better idea?



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -845,8 +877,6 @@ public void onContainersAllocated(List<Container> containers) {
               instanceName = null;
             }
           }
-          allocatedContainerCountMap.put(containerHelixTag,

Review Comment:
   I'm okay with your change, but the original way should also work



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001052159


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -845,8 +877,6 @@ public void onContainersAllocated(List<Container> containers) {
               instanceName = null;
             }
           }
-          allocatedContainerCountMap.put(containerHelixTag,

Review Comment:
   I re-read the API and I think you're correct. https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-client/apidocs/org/apache/hadoop/yarn/client/api/async/AMRMClientAsync.CallbackHandler.html
   
   Because this method is triggered by heartbeat and this method is called after completion, maybe any implementation would work here since it's not even concurrent lol. All writes to this map actually seem single threaded 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
ZihanLi58 commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1002141059


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -394,8 +397,8 @@ protected void shutDown() throws IOException {
       if (!this.containerMap.isEmpty()) {
         synchronized (this.allContainersStopped) {
           try {
-            // Wait 5 minutes for the containers to stop

Review Comment:
   Keep this comment?



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -789,11 +815,17 @@ private ImmutableMap.Builder<String, String> buildContainerStatusEventMetadata(C
 
   /**
    * Get the number of matching container requests for the specified resource memory and cores.
+   * Due to YARN-1902, this API is not 100% accurate. However, {@link AMRMClientCallbackHandler#onContainersAllocated(List)}
+   * contains logic for best effort clean up of duplicate requests, and in practice is pretty accurate.
    */
   private int getMatchingRequestsCount(Resource resource) {
     int priorityNum = resourcePriorityMap.getOrDefault(resource.toString(), 0);
     Priority priority = Priority.newInstance(priorityNum);
-    return getAmrmClientAsync().getMatchingRequests(priority, ResourceRequest.ANY, resource).size();
+    List<? extends Collection> outstandingRequests = getAmrmClientAsync().getMatchingRequests(priority, ResourceRequest.ANY, resource);

Review Comment:
   Can we add this context that "API returns a list of collections, where each collection is the requests partitioned by the node ID" into comment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999770296


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -789,11 +815,17 @@ private ImmutableMap.Builder<String, String> buildContainerStatusEventMetadata(C
 
   /**
    * Get the number of matching container requests for the specified resource memory and cores.
+   * Due to YARN-1902, this API is not 100% accurate. However, {@link AMRMClientCallbackHandler#onContainersAllocated(List)}
+   * contains logic for best effort clean up of duplicate requests, and in practice is pretty accurate.

Review Comment:
   This comment is in reference to line `898`. The accuracy of this API request is actually very subtle due to [`Yarn-1902`](https://issues.apache.org/jira/browse/YARN-1902). The workaround is best effort but in my testing was surprisingly accurate. 
   
   While testing, I deleted the if statement code block near `898` and the number of container requests exploded. See the details in the ticket if anyone is curious



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -789,11 +815,17 @@ private ImmutableMap.Builder<String, String> buildContainerStatusEventMetadata(C
 
   /**
    * Get the number of matching container requests for the specified resource memory and cores.
+   * Due to YARN-1902, this API is not 100% accurate. However, {@link AMRMClientCallbackHandler#onContainersAllocated(List)}
+   * contains logic for best effort clean up of duplicate requests, and in practice is pretty accurate.

Review Comment:
   This comment is in reference to line `898`. The accuracy of this API request is actually very subtle due to [`Yarn-1902`](https://issues.apache.org/jira/browse/YARN-1902). The workaround is best effort but in my testing was surprisingly accurate. 
   
   While testing, I deleted the if statement code block near `898` and the number of container requests exploded. See the details in the ticket if anyone if curious



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1003565204


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -889,11 +902,11 @@ public void onContainersAllocated(List<Container> containers) {
         allocatedContainerCountMap.putIfAbsent(containerHelixTag, new AtomicInteger(0));
         allocatedContainerCountMap.get(containerHelixTag).incrementAndGet();
 
-        // Find matching requests and remove the request to reduce the chance that a subsequent request
-        // will request extra containers. YARN does not have a delta request API and the requests are not
-        // cleaned up automatically.
+        // Find matching requests and remove the request (YARN-660). We the scheduler are responsible

Review Comment:
   https://issues.apache.org/jira/browse/YARN-660?focusedCommentId=13655384&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13655384
   
   > This jira takes up the problem of helping schedulers find matching requests for allocated containers.
   > 
   > ...
   > 
   > Now the API becomes elegant and intuitive. addContainerRequest() to add requests. getMatchingRequests() to get all matching requests. Pick a container from matching requests and call removeContainerRequest() with it to remove it. That is all that there is to it. There are some other minor fixes to AMRMClientAsync. We are on the same page wrt this being a needed functionality.
   > 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999761558


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -474,11 +477,21 @@ public synchronized void requestTargetNumberOfContainers(YarnContainerRequestBun
     for (Map.Entry<String, Integer> entry : yarnContainerRequestBundle.getHelixTagContainerCountMap().entrySet()) {
       String currentHelixTag = entry.getKey();
       int desiredContainerCount = entry.getValue();
+      Resource resourceForHelixTag = yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag);
+
       // Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn
-      int requestedContainerCount = allocatedContainerCountMap.getOrDefault(currentHelixTag, 0)
-          + getMatchingRequestsCount(yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
-      for(; requestedContainerCount < desiredContainerCount; requestedContainerCount++) {
-        requestContainer(Optional.absent(), yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
+      allocatedContainerCountMap.putIfAbsent(currentHelixTag, new AtomicInteger(0));

Review Comment:
   Most of the rest of the changes are logging / code smell changes. 



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -474,11 +477,21 @@ public synchronized void requestTargetNumberOfContainers(YarnContainerRequestBun
     for (Map.Entry<String, Integer> entry : yarnContainerRequestBundle.getHelixTagContainerCountMap().entrySet()) {
       String currentHelixTag = entry.getKey();
       int desiredContainerCount = entry.getValue();
+      Resource resourceForHelixTag = yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag);
+
       // Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn
-      int requestedContainerCount = allocatedContainerCountMap.getOrDefault(currentHelixTag, 0)
-          + getMatchingRequestsCount(yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
-      for(; requestedContainerCount < desiredContainerCount; requestedContainerCount++) {
-        requestContainer(Optional.absent(), yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
+      allocatedContainerCountMap.putIfAbsent(currentHelixTag, new AtomicInteger(0));

Review Comment:
   The main logical change here is changing the value type from `Integer` to `AtomicInteger`. This prevents potential desync between our allocatedContainerCountMap (HelixTag -> numCountainers) and the containerMap (containerId -> Container POJO) due to race condition.
   
   I suspect the incorrect value for allocated container map is one of the reasons we saw incorrect behavior with shrink + allocating. This edge case would be generally rare and that's why we only start to see issues after a large number of days and the issue is resolved after a restart. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999752146


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -474,11 +477,21 @@ public synchronized void requestTargetNumberOfContainers(YarnContainerRequestBun
     for (Map.Entry<String, Integer> entry : yarnContainerRequestBundle.getHelixTagContainerCountMap().entrySet()) {
       String currentHelixTag = entry.getKey();
       int desiredContainerCount = entry.getValue();
+      Resource resourceForHelixTag = yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag);
+
       // Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn
-      int requestedContainerCount = allocatedContainerCountMap.getOrDefault(currentHelixTag, 0)
-          + getMatchingRequestsCount(yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
-      for(; requestedContainerCount < desiredContainerCount; requestedContainerCount++) {
-        requestContainer(Optional.absent(), yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
+      allocatedContainerCountMap.putIfAbsent(currentHelixTag, new AtomicInteger(0));

Review Comment:
   The main logical change is changing the value type from `Integer` to `AtomicInteger`. This prevents potential desync between our allocatedContainerCountMap (HelixTag -> numCountainers) and the containerMap (containerId -> Container POJO) due to race condition.
   
   I suspect the incorrect value for allocated container map is one of the reasons we saw incorrect behavior with shrink + allocating. This edge case would be generally rare and that's why we only start to see issues after a large number of days and the issue is resolved after a restart. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 merged pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
ZihanLi58 merged PR #3586:
URL: https://github.com/apache/gobblin/pull/3586


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999752146


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -474,11 +477,21 @@ public synchronized void requestTargetNumberOfContainers(YarnContainerRequestBun
     for (Map.Entry<String, Integer> entry : yarnContainerRequestBundle.getHelixTagContainerCountMap().entrySet()) {
       String currentHelixTag = entry.getKey();
       int desiredContainerCount = entry.getValue();
+      Resource resourceForHelixTag = yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag);
+
       // Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn
-      int requestedContainerCount = allocatedContainerCountMap.getOrDefault(currentHelixTag, 0)
-          + getMatchingRequestsCount(yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
-      for(; requestedContainerCount < desiredContainerCount; requestedContainerCount++) {
-        requestContainer(Optional.absent(), yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
+      allocatedContainerCountMap.putIfAbsent(currentHelixTag, new AtomicInteger(0));

Review Comment:
   The main logical change here is changing the value type from `Integer` to `AtomicInteger`. This prevents potential desync between our allocatedContainerCountMap (HelixTag -> numCountainers) and the containerMap (containerId -> Container POJO) due to race condition.
   
   I suspect the incorrect value for allocated container map is one of the reasons we saw incorrect behavior during shrink + allocating. This edge case would be generally rare and that's why we only start to see issues after a large number of days and the issue is resolved after a restart. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r999761558


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -474,11 +477,21 @@ public synchronized void requestTargetNumberOfContainers(YarnContainerRequestBun
     for (Map.Entry<String, Integer> entry : yarnContainerRequestBundle.getHelixTagContainerCountMap().entrySet()) {
       String currentHelixTag = entry.getKey();
       int desiredContainerCount = entry.getValue();
+      Resource resourceForHelixTag = yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag);
+
       // Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn
-      int requestedContainerCount = allocatedContainerCountMap.getOrDefault(currentHelixTag, 0)
-          + getMatchingRequestsCount(yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
-      for(; requestedContainerCount < desiredContainerCount; requestedContainerCount++) {
-        requestContainer(Optional.absent(), yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
+      allocatedContainerCountMap.putIfAbsent(currentHelixTag, new AtomicInteger(0));

Review Comment:
   Most of the rest of the changes are logging / code smell changes. Except for https://github.com/apache/gobblin/pull/3586#discussion_r999766182



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001069688


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -845,8 +877,6 @@ public void onContainersAllocated(List<Container> containers) {
               instanceName = null;
             }
           }
-          allocatedContainerCountMap.put(containerHelixTag,

Review Comment:
   But then again the comment above says
   ```    
   Ensure that updates to unusedHelixInstanceNames are visible to other threads that might concurrently
   invoke the callback on container allocation.
   ```
   
   Which implies there could be multiple callers to this single object. In that case, why don't we have a lock for oncomplete method which decrements this count? 
   
   Either way, the code is so painful to reason about when we have all these concurrent maps + explicit locks... Gives a sense of "safety" but still is very error prone IMO



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001027749


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -474,11 +477,21 @@ public synchronized void requestTargetNumberOfContainers(YarnContainerRequestBun
     for (Map.Entry<String, Integer> entry : yarnContainerRequestBundle.getHelixTagContainerCountMap().entrySet()) {
       String currentHelixTag = entry.getKey();
       int desiredContainerCount = entry.getValue();
+      Resource resourceForHelixTag = yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag);
+
       // Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn
-      int requestedContainerCount = allocatedContainerCountMap.getOrDefault(currentHelixTag, 0)
-          + getMatchingRequestsCount(yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
-      for(; requestedContainerCount < desiredContainerCount; requestedContainerCount++) {
-        requestContainer(Optional.absent(), yarnContainerRequestBundle.getHelixTagResourceMap().get(currentHelixTag));
+      allocatedContainerCountMap.putIfAbsent(currentHelixTag, new AtomicInteger(0));
+      int allocatedContainers = allocatedContainerCountMap.get(currentHelixTag).get();
+      int outstandingContainerRequests = getMatchingRequestsCount(resourceForHelixTag);
+      int requestedContainerCount = allocatedContainers + outstandingContainerRequests;

Review Comment:
   The main thing is this log line gives us information per helix tag. So I do want to log this information because allocatedContainers can be too general. 
   
   But I agree the log could be misleading. My original intention is that the helix tag in the front of the log line would imply these numbers are only relevant to the helix tag in question



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001030781


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -857,6 +887,8 @@ public void onContainersAllocated(List<Container> containers) {
 
         ContainerInfo containerInfo = new ContainerInfo(container, instanceName, containerHelixTag);
         containerMap.put(container.getId(), containerInfo);
+        allocatedContainerCountMap.putIfAbsent(containerHelixTag, new AtomicInteger(0));

Review Comment:
   Yeah it more to emulate the same behavior as getOrDefault. You're right that it's probably a no-op



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3586: [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior

Posted by GitBox <gi...@apache.org>.
homatthew commented on code in PR #3586:
URL: https://github.com/apache/gobblin/pull/3586#discussion_r1001148782


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -845,8 +877,6 @@ public void onContainersAllocated(List<Container> containers) {
               instanceName = null;
             }
           }
-          allocatedContainerCountMap.put(containerHelixTag,

Review Comment:
   Turns out it is thread safe to iterate over concurrent hash set. https://stackoverflow.com/questions/56099744/java-concurrenthashset-iterating-over-it-in-a-multi-threaded-environment
   
   And the reason for the lock / synchronized block is because if multiple threads are iterating over the same concurrent hashset and one of the threads calls remove, then since the iterator is essentially a snapshot they could each re-use the same helix instance.
   
   The sync block makes the removal process completely thread safe



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org