You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/02/01 05:11:59 UTC

[GitHub] [pulsar] michaeljmarshall opened a new pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

michaeljmarshall opened a new pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393


   Fixes: #6368
   
   Flaky-test: `AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup`
   
   TestClass is flaky. The testMethod test method fails sporadically.
   
   See #6368 for example failures as well as for my analysis and detailed justification for this change.
   
   In short, by setting the `LoadBalancerBrokerOverloadedThresholdPercentage` to `100`, we remove the main edge case that allows two namespaces in the same anti-affinity group to get placed on the same broker.
   
   Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure. https://github.com/apache/pulsar/blob/85f3ff4edbaa10c7894af8ad823cbce37b13829c/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/LocalBrokerData.java#L214-L217


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r573545104



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/GenericBrokerHostUsageImpl.java
##########
@@ -62,8 +67,10 @@ public SystemResourceUsage getBrokerHostUsage() {
     }
 
     private void checkCpuLoad() {
-        cpuUsageSum += systemBean.getSystemCpuLoad();
-        cpuUsageCount++;
+        synchronized (this) {

Review comment:
       @michaeljmarshall if you happen to make further changes, it would be useful to rebase the PR on latest master branch changes. Good work! Thank you




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-771394148


   @lhotari - thank you for your detailed explanation. I did not consider running this on a special environment to simulate the testing env. That is a great point, and something I'll keep in mind in the future.
   
   Based on the example logging output you provided, it looks like it is still the broker being overridden because it is considered overloaded. See the following:
   
   ```
   07:04:53.511 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-c8478edb-886e-43b7-8d43-12cc159c0eb9/use/ns1/0x00000000_0xffffffff
   07:04:53.511 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:33199 is overloaded: max usage=1.2440309524536133
   07:04:53.511 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:33199 is overloaded: CPU: 124.40309%, MEMORY: 19.42325%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0%
   07:04:53.516 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-c8478edb-886e-43b7-8d43-12cc159c0eb9/use/ns2/0x00000000_0xffffffff
   07:04:53.516 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:33199 is overloaded: max usage=1.2440309524536133
   07:04:53.516 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:33199 is overloaded: CPU: 124.40309%, MEMORY: 19.42325%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0%
   07:04:53.517 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:33199 is overloaded: max usage=1.2440309524536133
   07:04:53.517 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:33199 is overloaded: CPU: 124.40309%, MEMORY: 19.42325%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0%
   ```
   
   Given the first log line, it does look like two brokers are considered for placement. However, the CPU is listed at 124%, which leads to an override. I mentioned a concern about this in my initial PR message:
   
   > Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure.
   
   Perhaps it is worth bumping the limit to something like 300 in this one case? If we only have 2 cores, we won't exceed that. Although, that does leave us with the potential to see this flakiness again if we ever give the test more cores.
   
   Do you think the broker's cpu utilization is high because they are still in the process of starting up? If so, perhaps your suggested `await` command could help by giving the brokers time to stabilize.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-773108715


   @michaeljmarshall unfortunately, it remains flaky. The logs are at https://gist.github.com/lhotari/f75e3efaf5f7bf7536e6d98c1898789e . 
   
   I reproduced with these commands on Linux (Ubuntu / PopOS 20.10):
   ```
   $ git branch -D issue-6368-flaky-test # delete earlier branch for #9393 since it won't fast-forward
   $ gh pr checkout 9393
   $ mvn clean install -DskipTests -Dspotbugs.skip=true
   $ ptbx_until_test_fails_in_docker_with_logs -Pcore-modules -pl pulsar-broker -Dtest=AntiAffinityNamespaceGroupTest
   ```
   `ptbx_until_test_fails_in_docker_with_logs` is part of [the toolbox](https://github.com/lhotari/pulsar-contributor-toolbox#ptbx_until_test_fails_in_docker_with_logs) I am using. That function runs a mvn test in a loop with limited resource (`docker run --cpus=2 --memory=6g`) until the command fails.
   
   just in case anyone is interested, the ptbx_until_test_fails_in_docker_with_logs captures the logs which contain ansi color control codes, I stripped the ansi codes with ansi2txt before uploading the logs to [the gist](https://gist.github.com/lhotari/f75e3efaf5f7bf7536e6d98c1898789e):
   ```
   $ cat output_2021-02-04-093600.log | ansi2txt > output_2021-02-04-093600.log.txt
   $ gh gist create output_2021-02-04-093600.log.txt -d "testing PR 9393 https://github.com/apache/pulsar/commits/$(git rev-parse HEAD | colrm 9)"
   ```
   `gh` is the github's command line tooling, https://cli.github.com/ .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-775735510


   > Let me know if you found anything regarding this issue. My most recent commit should definitely add some stability to the `LinuxBrokerHostUsageImpl` by not calling it at nearly the same time.
   
   @michaeljmarshall yes, you are right about the race condition. Good catch. It's the `calculateBrokerHostUsage` call that is scheduled and the `calculateBrokerHostUsage` call in the constructor that cause the issue. 
   
   Besides the race, it also seems that this is a clear bug in the `calculateBrokerHostUsage` method itself. I'm now running the test loop (to see if it fails at some point) with these changes:
   https://github.com/apache/pulsar/commit/468550f471e3e1b4e5071fa9e5fac2f69713a398
   
   I have a few commits in my fork where I experimented with various improvements:
   https://github.com/apache/pulsar/compare/master...lhotari:issue-6368-flaky-test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-773108715


   @michaeljmarshall unfortunately, it remains flaky. The logs are at https://gist.github.com/lhotari/f75e3efaf5f7bf7536e6d98c1898789e . 
   
   I reproduced with these commands on Linux (Ubuntu / PopOS 20.10):
   ```
   $ git branch -D issue-6368-flaky-test # delete earlier branch for #9393 since it won't fast-forward
   $ gh pr checkout 9393
   $ mvn clean install -DskipTests -Dspotbugs.skip=true
   $ ptbx_until_test_fails_in_docker_with_logs -Pcore-modules -pl pulsar-broker -Dtest=AntiAffinityNamespaceGroupTest
   ```
   `ptbx_until_test_fails_in_docker_with_logs` is part of [the toolbox](https://github.com/lhotari/pulsar-contributor-toolbox#ptbx_until_test_fails_in_docker_with_logs) I am using. That function runs a mvn test in a loop with limited resource (`docker run --cpus=2 --memory=6g`) until the command fails.
   
   just in case anyone is interested, the ptbx_until_test_fails_in_docker_with_logs captures the logs which contain ansi color control codes, I stripped the ansi codes with ansi2txt before uploading the logs to [the gist](https://gist.github.com/lhotari/f75e3efaf5f7bf7536e6d98c1898789e):
   ```
   $ cat output_2021-02-04-093600.log | ansi2txt > output_2021-02-04-093600.log.txt
   $ gh gist create output_2021-02-04-093600.log.txt -d "testing PR 9393 https://github.com/apache/pulsar/commits/$(git rev-parse HEAD | colrm 9)"
   ```
   `gh` is the github's command line tooling, https://cli.github.com/ .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-770571883


   Looks like the `ResourceUsage` class does not require `usage` to be less than `limit`. However, in looking at this classes usage, I don't see any places where the `usage` would be more than the `limit`.
   
   https://github.com/apache/pulsar/blob/73e0dbd4df36571460eeb070964ec498d1955c1d/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/ResourceUsage.java


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-775287323


   I dug into this a bit more this morning, and it's interesting to point out that there are two fundamental differences between the system I am running tests on (macOS, or non linux) and @lhotari's system, which is linux. I'm guessing the `Azure VM` @lhotari references above is a linux machine, but it'd be good to confirm. This _might_ explain some of the differences we're seeing regarding reproducibility. The following conditional block determines the way a broker calculates cpu (and all other) host usage:
   
   https://github.com/apache/pulsar/blob/7c68ade453136895d19fa4723351d26256f3d624/pulsar-broker/src/main/java/org/apache/pulsar/broker/tools/LoadReportCommand.java#L94-L102
   
   I plan to take a look at the `LinuxBrokerHostUsageImpl` class later. I wonder if there is a bug in that code because we saw that `Infinity` value for CPU usage in the logs above. I'm guessing we're dividing by 0 in some rare edge case.
   
   There could certainly be an issue in the `GenericBrokerHostUsageImpl` class, too, but it seems to me that we're more consistently reproducing the problem in the linux version, so that's where I'll look at the linux class next.
   
   It's worth mentioning that neither of these Usage classes are explicitly tested.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r573457420



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -75,18 +75,17 @@ public LinuxBrokerHostUsageImpl(int hostUsageCheckIntervalMin,
         this.lastCollection = 0L;
         this.usage = new SystemResourceUsage();
         this.overrideBrokerNicSpeedGbps = overrideBrokerNicSpeedGbps;
-        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
-                hostUsageCheckIntervalMin, TimeUnit.MINUTES);
 
         boolean isCGroupsEnabled = false;
         try {
              isCGroupsEnabled = Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
         } catch (Exception e) {
             log.warn("Failed to check cgroup CPU usage file: {}", e.getMessage());
         }
-
         this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();

Review comment:
       > I'm wondering if we should add logic to synchronize on one of the variables? If you have another suggestion, happy to look into it.
   
   Good catch. Yes, using `synchronized` is fine there. 
   
   In `GenericBrokerHostUsageImpl`, it could be worth setting `initialDelay` to the the scheduling period and call the methods in the constructor thread once to ensure that initialization happens in the constructor.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r573529103



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -75,18 +75,17 @@ public LinuxBrokerHostUsageImpl(int hostUsageCheckIntervalMin,
         this.lastCollection = 0L;
         this.usage = new SystemResourceUsage();
         this.overrideBrokerNicSpeedGbps = overrideBrokerNicSpeedGbps;
-        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
-                hostUsageCheckIntervalMin, TimeUnit.MINUTES);
 
         boolean isCGroupsEnabled = false;
         try {
              isCGroupsEnabled = Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
         } catch (Exception e) {
             log.warn("Failed to check cgroup CPU usage file: {}", e.getMessage());
         }
-
         this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();

Review comment:
       thanks @michaeljmarshall . LGTM. I'll run the tests I have just to verify that the flakiness is fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie merged pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
sijie merged pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-780312703


   @sijie, @eolivelli, @lhotari - given the discussion in today's community meeting about what branches should have LTS, I'm thinking this commit should be cherry picked to other release branches because the changes to the underlying host usage classes are concurrency bugs. (Even if we don't have true LTS versions, this patch should likely make it into 2.7.x and possibly 2.6.x)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r573855472



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/GenericBrokerHostUsageImpl.java
##########
@@ -62,8 +67,10 @@ public SystemResourceUsage getBrokerHostUsage() {
     }
 
     private void checkCpuLoad() {
-        cpuUsageSum += systemBean.getSystemCpuLoad();
-        cpuUsageCount++;
+        synchronized (this) {

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r572656526



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -75,18 +75,17 @@ public LinuxBrokerHostUsageImpl(int hostUsageCheckIntervalMin,
         this.lastCollection = 0L;
         this.usage = new SystemResourceUsage();
         this.overrideBrokerNicSpeedGbps = overrideBrokerNicSpeedGbps;
-        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
-                hostUsageCheckIntervalMin, TimeUnit.MINUTES);
 
         boolean isCGroupsEnabled = false;
         try {
              isCGroupsEnabled = Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
         } catch (Exception e) {
             log.warn("Failed to check cgroup CPU usage file: {}", e.getMessage());
         }
-
         this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();

Review comment:
       I don't think that the call to `calculateBrokerHostUsage` should be removed. It should remain so that the values are available after the constructor returns. Instead of removing this line, the scheduling should happen after this call and the initial delay of the scheduling should be set to `hostUsageCheckIntervalMin` to prevent the race. 
   
   The problem in the race should also be fixed, I have created a few commits in https://github.com/apache/pulsar/compare/master...lhotari:issue-6368-flaky-test to fix the issue that the concurrent calls to `calculateBrokerHostUsage` caused.
   
   Some suggested improvements to de-risk actual production flakiness:
   * [Make ResourceUsage class immutable](https://github.com/lhotari/pulsar/commit/7c5ea55fffd49a2b09b53ca0b1132eef0b1e67cf)
   * [Replace usage of Files.readAllBytes with Files.copy](https://github.com/lhotari/pulsar/commit/97a3fb609b6aea81327146d2296412c5599988b2)
   * [Pre-calculate the usage percentage in ResourceUsage class](https://github.com/lhotari/pulsar/commit/32aec9de38674f959d090305cb40b14f8c994df7)
   * [Fix initial cpu usage calculation and use System.nanoTime() for calculating elapsed time](https://github.com/apache/pulsar/commit/468550f471e3e1b4e5071fa9e5fac2f69713a398)
   
   One possibility is to cherry-pick these commits to this same PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-773108715


   @michaeljmarshall unfortunately, it remains flaky. The logs are at https://gist.github.com/lhotari/f75e3efaf5f7bf7536e6d98c1898789e . 
   
   I reproduced with these commands on Linux (Ubuntu / PopOS 20.10):
   ```
   $ git branch -D issue-6368-flaky-test
   $ gh pr checkout 9393
   $ mvn clean install -DskipTests -Dspotbugs.skip=true
   $ ptbx_until_test_fails_in_docker_with_logs -Pcore-modules -pl pulsar-broker -Dtest=AntiAffinityNamespaceGroupTest
   ```
   `ptbx_until_test_fails_in_docker_with_logs` is part of [the toolbox](https://github.com/lhotari/pulsar-contributor-toolbox#ptbx_until_test_fails_in_docker_with_logs) I am using. That function runs a mvn test in a loop with limited resource (`docker run --cpus=2 --memory=6g`) until the command fails.
   
   just in case anyone is interested, the ptbx_until_test_fails_in_docker_with_logs captures the logs which contain ansi color control codes, I stripped the ansi codes with ansi2txt before uploading the logs to [the gist](https://gist.github.com/lhotari/f75e3efaf5f7bf7536e6d98c1898789e):
   ```
   $ cat output_2021-02-04-093600.log | ansi2txt > output_2021-02-04-093600.log.txt
   $ gh gist create output_2021-02-04-093600.log.txt -d "testing PR 9393 https://github.com/apache/pulsar/commits/$(git rev-parse HEAD | colrm 9)"
   ```
   `gh` is the github's command line tooling, https://cli.github.com/ .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r572656988



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -75,18 +75,17 @@ public LinuxBrokerHostUsageImpl(int hostUsageCheckIntervalMin,
         this.lastCollection = 0L;
         this.usage = new SystemResourceUsage();
         this.overrideBrokerNicSpeedGbps = overrideBrokerNicSpeedGbps;
-        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
-                hostUsageCheckIntervalMin, TimeUnit.MINUTES);
 
         boolean isCGroupsEnabled = false;
         try {
              isCGroupsEnabled = Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
         } catch (Exception e) {
             log.warn("Failed to check cgroup CPU usage file: {}", e.getMessage());
         }
-
         this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();
+
+        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,

Review comment:
       change `initialDelay` parameter from `0` to `hostUsageCheckIntervalMin`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r572629004



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -75,18 +75,17 @@ public LinuxBrokerHostUsageImpl(int hostUsageCheckIntervalMin,
         this.lastCollection = 0L;
         this.usage = new SystemResourceUsage();
         this.overrideBrokerNicSpeedGbps = overrideBrokerNicSpeedGbps;
-        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
-                hostUsageCheckIntervalMin, TimeUnit.MINUTES);
 
         boolean isCGroupsEnabled = false;
         try {
              isCGroupsEnabled = Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
         } catch (Exception e) {
             log.warn("Failed to check cgroup CPU usage file: {}", e.getMessage());
         }
-
         this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();
+
+        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,

Review comment:
       @lhotari - because we were scheduling `calculateBrokerHostUsage` with a delay of `0` minutes and then calling it only a few lines later, it is quite possible that there could have been a race internally that led to a calculated `elapsedSeconds` of `0`. If `elapsedSeconds` is `0`, we'll divide by by `0` in the method to calculate `cpuUsage`, which could likely be our culprit for the "Infinite" cpu usage you saw in your test.
   
   Here is the full method:
   
   https://github.com/apache/pulsar/blob/ce14ec4d0490cdad8fcc1146bb4d4a1b6cbd5979/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java#L96-L127
   
   Also, note that this method likely should not have been called before setting `isCGroupsEnabled` because the internals of the method rely on that value. Perhaps @merlimat can provide a bit more context for whether or not this update adds value here (he added the `isCGroupsEnabled` code this past July).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-771394148






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-773092792


   Thanks for the great effort on fixing this @michaeljmarshall . I'll test these changes with my setup to see if the earlier problem that I was able to reproduce is fixed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r573437508



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -75,18 +75,17 @@ public LinuxBrokerHostUsageImpl(int hostUsageCheckIntervalMin,
         this.lastCollection = 0L;
         this.usage = new SystemResourceUsage();
         this.overrideBrokerNicSpeedGbps = overrideBrokerNicSpeedGbps;
-        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
-                hostUsageCheckIntervalMin, TimeUnit.MINUTES);
 
         boolean isCGroupsEnabled = false;
         try {
              isCGroupsEnabled = Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
         } catch (Exception e) {
             log.warn("Failed to check cgroup CPU usage file: {}", e.getMessage());
         }
-
         this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();

Review comment:
       Great point. I agree that we shouldn't return until the class is fully initialized. I'm guessing that was the original intent of including that method call there, too. I can update the constructor as you suggested, and I think it likely makes the most sense to cherry-pick your commits.
   
   I also noticed that there is a race condition in the `GenericBrokerHostUsageImpl`, where two different scheduled methods make calls that update `cpuUsageSum` and `cpuUsageCount`. I'm wondering if we should add logic to `synchronize` on one of the variables? If you have another suggestion, happy to look into it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] eolivelli commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
eolivelli commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-776534967


   @codelipenghui @jiazhai @wolfstudy PTAL


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-771401543


   > Perhaps it is worth bumping the limit to something like 300 in this one case? If we only have 2 cores, we won't exceed that. Although, that does leave us with the potential to see this flakiness again if we ever give the test more cores.
   
   I guess we would have to experiment to find a solution that makes sense. I don't know this area of Pulsar so I could give a direct advice. 
   
   
   > Do you think the broker's cpu utilization is high because they are still in the process of starting up? If so, perhaps your suggested `await` command could help by giving the brokers time to stabilize.
   
   My assumption was simply that the test might start executing before both brokers are available. That has been a source of flakiness at least in DiscoveryServiceTest. I didn't confirm this assumption in any way. A simple approach would be to experiment and check whether the problem reproduces after making changes. Being able to experiment, would require having ways to reproduce the issue in environment where you can quickly run experiments. 
   
   btw. I have published my "toolbox" as open source in https://github.com/lhotari/pulsar-contributor-toolbox . That contains shell script functions that I use for reproducing Pulsar flaky test failures. It works for me when I use Linux, zsh & sdkman (for JDK installation).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-771406448


   @lhotari - thanks for the insight. I'll test this locally in a VM (I'm running on a Mac) and see how bumping the overloaded threshold affects the test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-773459986


   @lhotari - for what it's worth, it looks like the failure is related to the cpu usage again, although this time it's not necessarily a value I would have expected--Infinity.
   
   ```text
   07:39:57.219 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-010508de-9e58-4c86-95d8-d9b28f60a7ec/use/ns1/0x00000000_0xffffffff
   07:39:57.219 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
   07:39:57.220 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-010508de-9e58-4c86-95d8-d9b28f60a7ec/use/ns2/0x00000000_0xffffffff
   07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
   07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   07:39:57.225 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
   07:39:57.225 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   !!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest].testBrokerSelectionForAntiAffinityGroup([])-------
   ```
   
   I'm not positive what solutions make sense here, but it does seem like a surprising behavior that the logic calculated the CPU usage at `Infinity%`. Seems like something worth looking into. Ultimately, it's the same logic for overloading that is the problem here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-775735510


   > Let me know if you found anything regarding this issue. My most recent commit should definitely add some stability to the `LinuxBrokerHostUsageImpl` by not calling it at nearly the same time.
   
   @michaeljmarshall yes, you are right about the race condition. Good catch. It's the `calculateBrokerHostUsage` call that is scheduled and the `calculateBrokerHostUsage` call in the constructor that cause the issue. 
   
   Besides the race, it also seems that this is a clear bug in the `calculateBrokerHostUsage` method itself. I'm now running the test loop (to see if it fails at some point) with these changes:
   https://github.com/apache/pulsar/commit/468550f471e3e1b4e5071fa9e5fac2f69713a398
   
   I have a few commits in my fork where I experimented with various improvements:
   https://github.com/apache/pulsar/compare/master...lhotari:issue-6368-flaky-test
   
   -Lari


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-770633183


   Nice work on this @michaeljmarshall . 
   
   However, it seems that the flakiness remains after this change.
   
   Sometimes it's hard to reproduce the flaky test failures locally. One thing that seems to be a common theme is that the flaky test failures happen in CI, but can be hard to produce in local environments. While working on the fix for flaky test MessageIdTest, I found a way to reproduce some flaky test failures effectively by limiting the CPU resources to somewhat similar that there is in CI. The CI tests run on an Azure VM that has 2 cores and about 6GB free RAM (IIRC). 
   
   Since I use Linux for development, the easiest way for me to constraint the resources of the test run was to use Docker.
   
   These are the commands I used to test this change:
   ```
   $ gh pr checkout 9393
   $ mvn clean install -DskipTests -Dspotbugs.skip=true
   $ docker run --cpus=2 --memory=6g -u 1000:1000 --net=host -it --rm -v $HOME:$HOME -w $PWD -v /etc/passwd:/etc/passwd:ro ubuntu bash -c 'source "$HOME/.sdkman/bin/sdkman-init.sh"; counter=0; while mvn -Pcore-modules -pl pulsar-broker test -DfailIfNoTests=false -Dtest=AntiAffinityNamespaceGroupTest -DredirectTestOutputToFile=false -DtestRetryCount=0; do echo "----------- LOOP $counter ---------------"; ((counter++)); done; echo "LOOP $counter"' | tee docker_output_`date +%s`.log
   ```
   
   here's the output: https://gist.github.com/lhotari/7d8c7ae0a9e1a26d92599c585ba64e13
   and pulsar-broker/target/surefire-reports/testng-results.xml  https://gist.github.com/lhotari/fbbcd1405d8f16be7106f8ece9f66084
   
   ```
   java.lang.AssertionError: did not expect [localhost:42919] but found [localhost:42919]
   at org.testng.Assert.fail(Assert.java:99)
   at org.testng.Assert.failEquals(Assert.java:1041)
   at org.testng.Assert.assertNotEqualsImpl(Assert.java:147)
   at org.testng.Assert.assertNotEquals(Assert.java:1531)
   at org.testng.Assert.assertNotEquals(Assert.java:1535)
   at org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup(AntiAffinityNamespaceGroupTest.java:427)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
   at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
   at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
   at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   ```
   
   What comes into mind is that the test might start before both brokers are available. 
   Something like https://github.com/apache/pulsar/blob/24f759c677bfe7cbb2228cab8a38f2ebd0893945/pulsar-discovery-service/src/test/java/org/apache/pulsar/discovery/service/DiscoveryServiceTest.java#L251-L252 could help with that?
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-775720402


   > One detail that immediately caught my eye in LinuxBrokerHostUsageImpl is the usage of System.currentTimeMillis(). Replacing that with System.nanoTime() could be helpful. The calculations that mix doubles and longs seem error prone.
   
   In looking at the class and the scheduling of the `calculateBrokerHostUsage` method, I can see that this class (and `GenericBrokerHostUsageImpl `) are not really designed to be called in immediate succession. (Side note, that makes me wonder about the choice to use `scheduleAtFixedRate` instead of `scheduleWithFixedDelay`.) Because we don't intend to call the methods in immediate succession, millis look like they could be fine, but I could definitely see making a switch to nanos too. What do you think?
   
   I also agree that the mix of longs and doubles seems error prone.
   
   > It would be useful to check how good the test coverage for this class is.
   
   These classes do seem pretty fundamental to load balancing and I believe they are also essential to the load shedding mechanisms, which are pretty important. There are tests that implicitly cover this code, but perhaps we can add something more explicit in another PR. It's likely that these tests aren't covered because some of the logic is hard to test. Reworking some of the classes to make them more testable might be necessary.
   
   > The code in LinuxBrokerHostUsageImpl rings some bells. I faced some issues earlier where the method to read the Linux /proc and /sys/fs files with Files.readAllBytes was causing odd issues. I'll have to check if I reported an issue back then or brought it up on Pulsar Slack etc.
   
   Let me know if you found anything regarding this issue. My most recent commit should definitely add some stability to the `LinuxBrokerHostUsageImpl` by not calling it at nearly the same time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-770633183


   Nice work on this @michaeljmarshall . 
   
   However, it seems that the flakiness remains after this change.
   
   Sometimes it's hard to reproduce the flaky test failures locally. One thing that seems to be a common theme is that the flaky test failures happen in CI, but can be hard to produce in local environments. While working on the fix for flaky test MessageIdTest, I found a way to reproduce some flaky test failures effectively by limiting the CPU resources to somewhat similar that there is in CI. The CI tests run on an Azure VM that has 2 cores and about 6GB free RAM (IIRC). 
   
   Since I use Linux for development, the easiest way for me to constraint the resources of the test run was to use Docker. On other than Linux, VM tooling such as https://multipass.run/ could be helpful in creating an environment with limited cpu resources.
   
   These are the commands I used to test this change:
   ```
   $ gh pr checkout 9393
   $ mvn clean install -DskipTests -Dspotbugs.skip=true
   $ docker run --cpus=2 --memory=6g -u 1000:1000 --net=host -it --rm -v $HOME:$HOME -w $PWD -v /etc/passwd:/etc/passwd:ro ubuntu bash -c 'source "$HOME/.sdkman/bin/sdkman-init.sh"; counter=0; while mvn -Pcore-modules -pl pulsar-broker test -DfailIfNoTests=false -Dtest=AntiAffinityNamespaceGroupTest -DredirectTestOutputToFile=false -DtestRetryCount=0; do echo "----------- LOOP $counter ---------------"; ((counter++)); done; echo "LOOP $counter"' | tee docker_output_`date +%s`.log
   ```
   
   here's the output: https://gist.github.com/lhotari/7d8c7ae0a9e1a26d92599c585ba64e13
   and pulsar-broker/target/surefire-reports/testng-results.xml  https://gist.github.com/lhotari/fbbcd1405d8f16be7106f8ece9f66084
   
   ```
   java.lang.AssertionError: did not expect [localhost:42919] but found [localhost:42919]
   at org.testng.Assert.fail(Assert.java:99)
   at org.testng.Assert.failEquals(Assert.java:1041)
   at org.testng.Assert.assertNotEqualsImpl(Assert.java:147)
   at org.testng.Assert.assertNotEquals(Assert.java:1531)
   at org.testng.Assert.assertNotEquals(Assert.java:1535)
   at org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup(AntiAffinityNamespaceGroupTest.java:427)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
   at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
   at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
   at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   ```
   
   What comes into mind is that the test might start before both brokers are available. 
   Something like https://github.com/apache/pulsar/blob/24f759c677bfe7cbb2228cab8a38f2ebd0893945/pulsar-discovery-service/src/test/java/org/apache/pulsar/discovery/service/DiscoveryServiceTest.java#L251-L252 could help with that?
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-771401543






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-772303414


   @lhotari - I worked trying to get my environment set up tonight (just testing out setting the cpu/memory limits on my mac's docker desktop app, no extra VM yet), but I wasn't able to get to a place where I could test out this code change. I submitted a new commit with an updated threshold, which I think should work. If it doesn't work, I can continue tomorrow evening.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall edited a comment on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall edited a comment on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-773459986


   @lhotari - thank you for running that for me. For what it's worth, it looks like the failure is related to the cpu usage again, although this time it's not necessarily a value I would have expected--Infinity.
   
   ```text
   07:39:57.219 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-010508de-9e58-4c86-95d8-d9b28f60a7ec/use/ns1/0x00000000_0xffffffff
   07:39:57.219 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
   07:39:57.220 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-010508de-9e58-4c86-95d8-d9b28f60a7ec/use/ns2/0x00000000_0xffffffff
   07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
   07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   07:39:57.225 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
   07:39:57.225 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   !!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest].testBrokerSelectionForAntiAffinityGroup([])-------
   ```
   
   I'm not positive what solutions make sense here, but it does seem like a surprising behavior that the logic calculated the CPU usage at `Infinity%`. Seems like something worth looking into. Ultimately, it's the same logic for overloading that is the problem here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r572653344



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -75,18 +75,17 @@ public LinuxBrokerHostUsageImpl(int hostUsageCheckIntervalMin,
         this.lastCollection = 0L;
         this.usage = new SystemResourceUsage();
         this.overrideBrokerNicSpeedGbps = overrideBrokerNicSpeedGbps;
-        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
-                hostUsageCheckIntervalMin, TimeUnit.MINUTES);
 
         boolean isCGroupsEnabled = false;
         try {
              isCGroupsEnabled = Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
         } catch (Exception e) {
             log.warn("Failed to check cgroup CPU usage file: {}", e.getMessage());
         }
-
         this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();
+
+        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,

Review comment:
       Thanks for taking a look--it's good to know we're narrowing in on something. I'll take a look again tomorrow.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-777022351


   @sijie @merlimat PTAL. This is a fix to a common flaky test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall edited a comment on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall edited a comment on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-770571883


   Looks like the `ResourceUsage` class does not require `usage` to be less than `limit`. However, in looking at this class's usage, I don't see any places where the `usage` would be more than the `limit`.
   
   https://github.com/apache/pulsar/blob/73e0dbd4df36571460eeb070964ec498d1955c1d/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/ResourceUsage.java#L27-L68


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r572635313



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -75,18 +75,17 @@ public LinuxBrokerHostUsageImpl(int hostUsageCheckIntervalMin,
         this.lastCollection = 0L;
         this.usage = new SystemResourceUsage();
         this.overrideBrokerNicSpeedGbps = overrideBrokerNicSpeedGbps;
-        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
-                hostUsageCheckIntervalMin, TimeUnit.MINUTES);
 
         boolean isCGroupsEnabled = false;
         try {
              isCGroupsEnabled = Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
         } catch (Exception e) {
             log.warn("Failed to check cgroup CPU usage file: {}", e.getMessage());
         }
-
         this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();
+
+        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,

Review comment:
       @michaeljmarshall I was also assuming some race condition and I made an experiment on top of your changes where I replaced the `Files.readAllBytes` usage and made `ResourceUsage` immutable: https://github.com/apache/pulsar/compare/master...lhotari:issue-6368-flaky-test . However, the same problem still reproduces.  CPU shows as Infinity in the log lines:
   ```
   06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbala
   nce.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-98033a04-3a4d-4e5d-9203-
   3147486f2673/use/ns1/0x00000000_0xffffffff
   06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
   nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity
   06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
   nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%, 
   DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbala
   nce.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-98033a04-3a4d-4e5d-9203-
   3147486f2673/use/ns2/0x00000000_0xffffffff
   06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
   nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity
   06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
   nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%, 
   DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   06:52:40.338 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
   nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity
   06:52:40.338 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
   nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%, 
   DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   !!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest]
   .testBrokerSelectionForAntiAffinityGroup([])-------
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-771581807


   I took a quite look at the test and most likely my suggestion about checking the broker availability in setup doesn't make sense. What you already mentioned about the overload situation could be useful to validate by modifying the code and removing the logic temporarily from ModularLoadManagerImpl. For that, you would have to be able to first consistently reproduce the issue and then experimenting will be easy.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-775386084


   The code in `LinuxBrokerHostUsageImpl` rings some bells. I faced some issues earlier where the method to read the Linux `/proc` and `/sys/fs` files with `Files.readAllBytes` was causing odd issues. I'll have to check if I reported an issue back then or brought it up on Pulsar Slack etc.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-775359872


   > I'm guessing the `Azure VM` @lhotari references above is a linux machine, but it'd be good to confirm.
   
   @michaeljmarshall yes, it's a VM running Ubuntu Linux. There's [GitHub Actions Runner VMs specs](https://docs.github.com/en/actions/reference/specifications-for-github-hosted-runners#supported-runners-and-hardware-resources) for more details about the environment. It's possible to debug such an environment by getting a shell inside a running VM with https://github.com/marketplace/actions/debugging-with-tmate in your own fork, for example. To debug the GitHub Actions VM, you'd enable GitHub Actions in your own fork in some repository and open a PR to your own fork. However the downside of that is that it's very time consuming. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-773090796


   @lhotari - After running this several times on my machine, I haven't been able to reproduce the failure with the limit set to 400. Although, I also wasn't able to get the broker's CPU utilization to go above 100%, but it looks like you were, so I'm not sure what that means because you had over 100% utilization. At what point do we consider this the solution to the flaky test? Thank you for your help.
   
   As an aside, it does seem that using docker to limit CPU/mem is an option on a mac. Here is my `Dockerfile`
   
   ```Dockerfile
   FROM openjdk:8-jdk-slim
   
   RUN mkdir -p /usr/share/man/man1
   
   RUN apt update && apt -y install maven
   
   ENTRYPOINT ["/bin/bash"]
   ```
   
   and then I started the container:
   
   ```shell
   docker run --cpus=2 --memory=3g -it --entrypoint bash -v /Users/Michael:/root pulsar-test-env:0.0.0
   ```
   
   and after navigating to my `pulsar` dir, ran
   
   ```shell
   mvn test -Dtest=AntiAffinityNamespaceGroupTest -DfailIfNoTests=false -pl pulsar-broker
   ```
   
   (Because I mounted my home to the container's home, maven worked properly so I could run the build and install outside of the container.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r573537932



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/GenericBrokerHostUsageImpl.java
##########
@@ -62,8 +67,10 @@ public SystemResourceUsage getBrokerHostUsage() {
     }
 
     private void checkCpuLoad() {
-        cpuUsageSum += systemBean.getSystemCpuLoad();
-        cpuUsageCount++;
+        synchronized (this) {

Review comment:
       Great point. I think that makes more sense. I'll update it when I'm back online.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall edited a comment on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall edited a comment on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-770571883


   Looks like the `ResourceUsage` class does not require `usage` to be less than `limit`. However, in looking at this classes usage, I don't see any places where the `usage` would be more than the `limit`.
   
   https://github.com/apache/pulsar/blob/73e0dbd4df36571460eeb070964ec498d1955c1d/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/ResourceUsage.java#L27-L68


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r573485072



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -75,18 +75,17 @@ public LinuxBrokerHostUsageImpl(int hostUsageCheckIntervalMin,
         this.lastCollection = 0L;
         this.usage = new SystemResourceUsage();
         this.overrideBrokerNicSpeedGbps = overrideBrokerNicSpeedGbps;
-        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
-                hostUsageCheckIntervalMin, TimeUnit.MINUTES);
 
         boolean isCGroupsEnabled = false;
         try {
              isCGroupsEnabled = Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
         } catch (Exception e) {
             log.warn("Failed to check cgroup CPU usage file: {}", e.getMessage());
         }
-
         this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();

Review comment:
       Done. Mind taking another look?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r572630364



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -75,18 +75,17 @@ public LinuxBrokerHostUsageImpl(int hostUsageCheckIntervalMin,
         this.lastCollection = 0L;
         this.usage = new SystemResourceUsage();
         this.overrideBrokerNicSpeedGbps = overrideBrokerNicSpeedGbps;
-        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
-                hostUsageCheckIntervalMin, TimeUnit.MINUTES);
 
         boolean isCGroupsEnabled = false;
         try {
              isCGroupsEnabled = Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
         } catch (Exception e) {
             log.warn("Failed to check cgroup CPU usage file: {}", e.getMessage());
         }
-
         this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();
+
+        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,

Review comment:
       Also note that any bad state, like an infinite cpu usage, would be stored for a minute until the cpu usage is recalculated. Considering that this test runs quickly, that would definitely be a culprit for our flaky test.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall edited a comment on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall edited a comment on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-773459986


   @lhotari - thank you for running that for me. For what it's worth, it looks like the failure is related to the cpu usage again, although this time it's not necessarily a value I would have expected--Infinity.
   
   ```text
   07:39:57.219 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-010508de-9e58-4c86-95d8-d9b28f60a7ec/use/ns1/0x00000000_0xffffffff
   07:39:57.219 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
   07:39:57.220 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-010508de-9e58-4c86-95d8-d9b28f60a7ec/use/ns2/0x00000000_0xffffffff
   07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
   07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   07:39:57.225 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
   07:39:57.225 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
   !!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest].testBrokerSelectionForAntiAffinityGroup([])-------
   ```
   
   I'm not positive what solutions make sense here, but it does seem like a surprising behavior that the logic calculated the CPU usage at `Infinity%`. Seems like something worth looking into. Ultimately, it's the same logic for overloading that is the problem here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] eolivelli commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r573533094



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/GenericBrokerHostUsageImpl.java
##########
@@ -62,8 +67,10 @@ public SystemResourceUsage getBrokerHostUsage() {
     }
 
     private void checkCpuLoad() {
-        cpuUsageSum += systemBean.getSystemCpuLoad();
-        cpuUsageCount++;
+        synchronized (this) {

Review comment:
       nit: what about moving the `synchronized` keyword in the method declaration ?

##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/GenericBrokerHostUsageImpl.java
##########
@@ -85,10 +92,12 @@ private double getTotalCpuLimit() {
     }
 
     private double getTotalCpuUsage() {
-        double cpuUsage = cpuUsageSum / cpuUsageCount;
-        cpuUsageSum = 0d;
-        cpuUsageCount = 0;
-        return cpuUsage;
+        synchronized (this) {

Review comment:
       nit: what about moving the `synchronized` keyword in the method declaration ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on a change in pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on a change in pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#discussion_r573856593



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/GenericBrokerHostUsageImpl.java
##########
@@ -85,10 +92,12 @@ private double getTotalCpuLimit() {
     }
 
     private double getTotalCpuUsage() {
-        double cpuUsage = cpuUsageSum / cpuUsageCount;
-        cpuUsageSum = 0d;
-        cpuUsageCount = 0;
-        return cpuUsage;
+        synchronized (this) {

Review comment:
       I did this, and I also moved the check for `cpuUsageCount` being zero into this method. It makes more sense to me in the synchronized method, and it will ensure that if any other methods end up calling this one, there isn't a chance of accidentally dividing by 0.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-775365369


   > I plan to take a look at the `LinuxBrokerHostUsageImpl` class later. I wonder if there is a bug in that code because we saw that `Infinity` value for CPU usage in the logs above. I'm guessing we're dividing by 0 in some rare edge case.
   
   @michaeljmarshall very good direction. One detail that immediately caught my eye in `LinuxBrokerHostUsageImpl` is the usage of `System.currentTimeMillis()`. Replacing that with `System.nanoTime()` could be helpful. The calculations that mix doubles and longs seem error prone. It would be useful to check how good the test coverage for this class is.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-773090796






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9393: Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9393:
URL: https://github.com/apache/pulsar/pull/9393#issuecomment-773092792






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org