You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/07/28 03:26:34 UTC

[GitHub] [pulsar] coderzc opened a new pull request, #16832: [feature][broker] Support cgroup v2 for by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

coderzc opened a new pull request, #16832:
URL: https://github.com/apache/pulsar/pull/16832

   Master Issue: #16601
   
   ### Motivation
   
   The Pulsar loadbalancer detects CPU limits using cgroup v1 API, and the `jdk.internal.platform.Metrics` already support cgroup (V1, v2) so we should use `jdk.internal.platform.Metrics` to get the cgroup metrics.
   
   ### Modifications
   
   Use `jdk.internal.platform.Metrics` to get the cgroup metrics in Pulsar Loadbalancer.
   
   ### Verifying this change
   
   - [ ] Make sure that the change passes the CI checks.
   
   This change is already covered by existing tests, such as *testCGroupMetrics*.
   
   *(example:)*
     - *Added integration tests for end-to-end deployment with large payloads (10MB)*
     - *Extended integration test for recovery after broker failure*
   
   ### Does this pull request potentially affect one of the following parts:
   
   *If `yes` was chosen, please highlight the changes*
   
     - Dependencies (does it add or upgrade a dependency): (yes / no)
     - The public API: (yes / no)
     - The schema: (yes / no / don't know)
     - The default values of configurations: (yes / no)
     - The wire protocol: (yes / no)
     - The rest endpoints: (yes / no)
     - The admin cli options: (yes / no)
     - Anything that affects deployment: (yes / no / don't know)
   
   ### Documentation
   
   Check the box below or label this PR directly.
   
   Need to update docs? 
   
   - [ ] `doc-required` 
   (Your PR needs to update docs and you will update later)
     
   - [x] `doc-not-needed` 
   (Please explain why)
     
   - [ ] `doc` 
   (Your PR contains doc changes)
   
   - [ ] `doc-complete`
   (Docs have been already added)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
coderzc commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1204862803

   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
coderzc commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1211489247

   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] github-actions[bot] commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1249979340

   The pr had no activity for 30 days, mark with Stale label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
coderzc commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r940183684


##########
pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/SimpleBrokerStartTest.java:
##########
@@ -97,4 +101,25 @@ public void testNoNICSpeed() throws Exception {
     }
 
 
+    @Test
+    public void testCGroupMetrics() throws IOException {

Review Comment:
   Good idea, we can assert `metrics != null`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] heesung-sn commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
heesung-sn commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1216854483

   
   > @heesung-sn What do you think?
   
   
   
   According to this article, 
   https://developers.redhat.com/articles/2022/04/19/java-17-whats-new-openjdks-container-awareness#recent_changes_in_openjdk_s_container_awareness_code
   
   It seems like `OperatingSystemMXBean` already provides `cpu usage in percentage relative to cgroup (for both v1 and v2)` (I will call it as `cpu_usage_in_percent_cgroup` for our discussion)
   
   I agree with you. From my understanding, load balancer cares about `cpu_usage_in_percent_cgroup` in the end for the cpu usage computation. 
   
   However, LB also requires other signals such as `memory_usage_in_percent_cgroup`, `network_usage_in_percent_cgroup`, which require separate limits for their percentage computation.
   
   As you pointed out, the problem is that LB uses the same generic code (requiring limit) to compute the resource percentage.
   
   Maybe we can tweak the code to ignore the limit and use the signal as-is if they are already in the `*_usage_in_percent_cgroup` form.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] nodece commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "nodece (via GitHub)" <gi...@apache.org>.
nodece commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1523721850

   Any updates? I think this PR makes sense to use the cgroup v2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] heesung-sn commented on a diff in pull request #16832: [improve][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "heesung-sn (via GitHub)" <gi...@apache.org>.
heesung-sn commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r1180577209


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/LinuxInfoUtils.java:
##########
@@ -98,11 +138,14 @@ public static double getTotalCpuLimit(boolean isCGroupsEnabled) {
      * Get CGroup cpu usage.
      * @return Cpu usage
      */
-    public static double getCpuUsageForCGroup() {
+    public static long getCpuUsageForCGroup() {
         try {
+            if (metrics != null && getCpuUsageMethod != null) {
+                return (long) getCpuUsageMethod.invoke(metrics);

Review Comment:
   But your getCpuUsageMethod returns a different unit(rate), bounding [0,1].
   
   Don't we need to convert this rate to the aggregated cpu usage time?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] codelipenghui commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r940068724


##########
pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/SimpleBrokerStartTest.java:
##########
@@ -97,4 +101,25 @@ public void testNoNICSpeed() throws Exception {
     }
 
 
+    @Test
+    public void testCGroupMetrics() throws IOException {

Review Comment:
   We should check if the test really uses the `jdk.internal.platform.Metrics` not the old way?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] asafm commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
asafm commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1217658537

   >However, LB also requires other signals such as dict_memory_usage_in_percent_cgroup, network_usage_in_percent_cgroup, which require separate limits for their percentage computation.
   Can you please expand on that? I don't any method providing usage and limit of direct memory in Pulsar. I do see usage and limit for physical memory (in the container) which as we said we can take from the mxBean which provides total and usage of memory. 
   Regarding network usage - yes, that I was surprised to see that the operating system didn't include, and we resort to reading operating system specific files to obtain that. I posted a [question](https://mail.openjdk.org/pipermail/discuss/2022-August/006141.html) in the OpenJDK general discussion mailing to get the motivation for that ( I did search their JIRA issue database and also googled the mailing list site but found nothing).
   
   >Maybe we can tweak the code to ignore the limit and use the signal as-is if they are already in the *_usage_in_percent_cgroup form.
   I guess that requires a separate PR. Maybe we can push that PR in but add an issue to refactor (delete it out) the dependency on limit and only use percentage (which should be enough) or refactor to avoid using limit and be more specific like ResourceUsage.CpuUsage, MemoryUsage, NetworkUsage, each providing it's own method, where the CPU would only provide percentage.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
coderzc commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1198847499

   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
coderzc commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r935777436


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/LinuxInfoUtils.java:
##########
@@ -65,9 +87,14 @@ public static boolean isLinux() {
      */
     public static boolean isCGroupEnabled() {
         try {
-            return Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
+            if (metrics == null) {
+                return false;

Review Comment:
   Thanks, I updated it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "coderzc (via GitHub)" <gi...@apache.org>.
coderzc commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1527583577

   > I guess that requires a separate PR. Maybe we can push that PR in but add an issue to refactor (delete it out) the dependency on limit and only use percentage (which should be enough) or refactor to avoid using limit and be more specific like ResourceUsage.CpuUsage, MemoryUsage, NetworkUsage, each providing it's own method, where the CPU would only provide percentage.
   
   Yes, we need a separate PR to refactor it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] asafm commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
asafm commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1214681645

   I don't understand something. You're using an internal class from the JDK - it's internal - you're not supposed to use it as is no?
   How is the JDK itself using this - and maybe we can this info in a more public API way?
   Can you describe in more detail the motivation to get those metrics and why normal MBean metrics don't cut it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] asafm commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
asafm commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1216226129

   Ok, I did some reading and this is what I found:
   
   1. Usage calculation in `OperatingSystemMXBean` which delegates to `OperatingSystemImp` gives you the percentage of usage of CPU relative to the limit imposed on the container group, as can be seen here:
   ```java
               long quota = containerMetrics.getCpuQuota();
               long share = containerMetrics.getCpuShares();
               if (quota > 0) {
                   long numPeriods = containerMetrics.getCpuNumPeriods();
                   long quotaNanos = TimeUnit.MICROSECONDS.toNanos(quota * numPeriods);
                   return getUsageDividesTotal(cpuUsageSupplier().getAsLong(), quotaNanos);
   ```
   
   which calls
   ```java
           private double getUsageDividesTotal(long usageTicks, long totalTicks) {
               // If cpu quota or cpu shares are in effect. Calculate the cpu load
               // based on the following formula (similar to how
               // getCpuLoad0() is being calculated):
               //
               //   | usageTicks - usageTicks' |
               //  ------------------------------
               //   | totalTicks - totalTicks' |
               //
               // where usageTicks' and totalTicks' are historical values
               // retrieved via an earlier call of this method.
               if (usageTicks < 0 || totalTicks <= 0) {
                   return -1;
               }
               long distance = usageTicks - this.usageTicks;
               this.usageTicks = usageTicks;
               long totalDistance = totalTicks - this.totalTicks;
               this.totalTicks = totalTicks;
               double systemLoad = 0.0;
               if (distance > 0 && totalDistance > 0) {
                   systemLoad = ((double)distance) / totalDistance;
               }
               // Ensure the return value is in the range 0.0 -> 1.0
               systemLoad = Math.max(0.0, systemLoad);
               systemLoad = Math.min(1.0, systemLoad);
               return systemLoad;
           }
   ```
   
   So we can obtain the percentage of CPU used relative to CPU allocated to the container (container group).
   As you can see the code pasted above also does delta calculation, relative to last time it was called. It is the same as we do as can be see in Pulsar code here:
   
   ```java
       private double getTotalCpuUsageForCGroup(double elapsedTimeSeconds) {
           double usage = getCpuUsageForCGroup();
           double currentUsage = usage - lastCpuUsage;
           lastCpuUsage = usage;
           return 100 * currentUsage / elapsedTimeSeconds / TimeUnit.SECONDS.toNanos(1);
       }
   ```
   
   The main point here is that JDK does provide relative usage or as you said "mean" value.
   
   2. The biggest difference I see is that when we report usage percent, we do so relative to the entire host: we take the CPU Usage for CGroup as reported by the operating system  (measured in microseconds), only the delta from last time measured, and divide that by elapsed, so in effect, it is CPU used relative to the entire host.
   
   ```java
       public static long getCpuUsageForCGroup() {
           try {
               if (metrics != null && getCpuUsageMethod != null) {
                   return (long) getCpuUsageMethod.invoke(metrics);
               }
   
   ```
   
   Which delegates to
   ```java
       /**
        * Returns the aggregate time, in nanoseconds, consumed by all
        * tasks in the Isolation Group.
        *
        * @return Time in nanoseconds, -1 if unknown or
        *         -2 if the metric is not supported.
        *
        */
       public long getCpuUsage();
   ```
   
   The limit we report is the percentage of CPU allocated, relative to the entire host. 
   
   It actually looked a bit odd to me: Why do we report usage relative to host and limit relative to host? Why not just report usage in percent relative to the allocated container limit?
   
   So I looked at how we are using it, as you also mentioned above. So I saw those references:
   
   ```java
               if (entry.getValue().limit > 0 && entry.getValue().limit > entry.getValue().usage) {
                   Double temp = ((entry.getValue().limit - entry.getValue().usage) / entry.getValue().limit) * 100;
                   percentAvailable = temp.intValue();
               }
   ```
   In the above code you can see it doesn't really need it the percent to be relative to host nor does it need the limit. It just needs to say: CPU usage in percent relative to container allocated CPU limit < 100%, which the JDK gives.
   
   ```java
              double cpuChange = (newUsage.cpu.limit > 0)
                               ? ((newUsage.cpu.usage - oldUsage.cpu.usage) * 100 / newUsage.cpu.limit)
                               : 0;
   ```
   
   In the above code, you can see it only needs to know how much percent has changed - same thing - it can use the JDK load percent.
   
   The biggest issue I have is only with the reporting
   ```java
       private void printResourceUsage(String name, ResourceUsage usage) {
           spec.console().println(name + " : usage = " + usage.usage + ", limit = " + usage.limit);
       }
   ```
   
   I don't really understand why would some need to know the limit. Why isn't the CPU percentage enough relative to the container allocation?
   
   I understand all I mentioned here is a bit of technical debt to fix in this PR. I'm saying that we didn't mind not reporting the limit, we didn't need to be very operating system specific, and write Linux-based reporting classes and read the info of it from the file system and now through JVM internal classes. We could just use `OperatingSystemMXBean`.
   
   
   @heesung-sn What do you think?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] nodece merged pull request #16832: [improve][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "nodece (via GitHub)" <gi...@apache.org>.
nodece merged PR #16832:
URL: https://github.com/apache/pulsar/pull/16832


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
coderzc commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r943076687


##########
pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/SimpleBrokerStartTest.java:
##########
@@ -97,4 +101,31 @@ public void testNoNICSpeed() throws Exception {
     }
 
 
+    @Test
+    public void testCGroupMetrics() throws IllegalAccessException {
+        if (!LinuxInfoUtils.isLinux()) {
+            return;
+        }
+
+        boolean existsCGroup = Files.exists(Paths.get("/sys/fs/cgroup"));
+        boolean cGroupEnabled = LinuxInfoUtils.isCGroupEnabled();
+        Assert.assertEquals(cGroupEnabled, existsCGroup);
+
+        double totalCpuLimit = LinuxInfoUtils.getTotalCpuLimit(cGroupEnabled);
+        log.info("totalCpuLimit: {}", totalCpuLimit);
+        Assert.assertTrue(totalCpuLimit > 0.0);
+
+        if (cGroupEnabled) {
+            Assert.assertNotNull(FieldUtils.readStaticField(LinuxInfoUtils.class, "metrics", true));

Review Comment:
   Why not use reflection to get it? This is just a test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] asafm commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
asafm commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1216235590

   Please note that I haven't found any other open source that uses `jdk.internal.platform.Metrics` - I tried searching using TabNine code search and also GitHub search.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] nodece commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
nodece commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r945396805


##########
pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/SimpleBrokerStartTest.java:
##########
@@ -97,4 +100,27 @@ public void testNoNICSpeed() throws Exception {
     }
 
 
+    @Test
+    public void testCGroupMetrics() {
+        if (!LinuxInfoUtils.isLinux()) {
+            return;
+        }
+
+        boolean existsCGroup = Files.exists(Paths.get("/sys/fs/cgroup"));
+        boolean cGroupEnabled = LinuxInfoUtils.isCGroupEnabled();
+        Assert.assertEquals(cGroupEnabled, existsCGroup);
+
+        double totalCpuLimit = LinuxInfoUtils.getTotalCpuLimit(cGroupEnabled);
+        log.info("totalCpuLimit: {}", totalCpuLimit);
+        Assert.assertTrue(totalCpuLimit > 0.0);
+
+        if (cGroupEnabled) {
+            Assert.assertNotNull(LinuxInfoUtils.metrics);

Review Comment:
   ```suggestion
               Assert.assertNotNull(LinuxInfoUtils.getMetrics());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
coderzc commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1215127433

   > Can you please explain why it is different? I do see that eventually, the load balancer needs the percentage right? how much was the CPU used relative to the limit which is 100%? getCPULoad of sun implementation does try to calculate that, no?
   
   We need to record `usage` and `limit`, but `OperatingSystemMXBean#getCpuLoad` only is the percentage.
   
   And for `usage` we calculate is a mean value, more please see:
   https://github.com/apache/pulsar/blob/380c7587d0cefdd763030f348246fef711bfd58c/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java#L139-L144
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] lhotari commented on pull request #16832: [improve][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "lhotari (via GitHub)" <gi...@apache.org>.
lhotari commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1609253371

   We need this to be included in branch-3.0 and branch-2.11 asap. I backported the changes to branch-2.10 with all dependencies in #20659.
   
   This is becoming urgent.
   
   AKS Kubernetes 1.25+ switches to use cgroup v2:
   https://github.com/Azure/AKS/releases/tag/2023-03-05
   
   [AKS Kubernetes 1.24 goes End-of-life on July 30, 2023](https://learn.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar).
   
   GKE contains to have a way to select between cgroup v1 & cgroup v2:
   https://cloud.google.com/kubernetes-engine/docs/how-to/node-system-config#cgroup-mode-options
   GKE will default to cgroup v2 in new Kubernetes 1.26 clusters or node pools.
   AWS EKS v1.26 nodes will default to cgroup v2.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] heesung-sn commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
heesung-sn commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1218279647

   
   > Can you please expand on that? I don't any method providing usage and limit of direct memory in Pulsar. I do see usage and limit for physical memory (in the container) which as we said we can take from the mxBean which provides total and usage of memory.
   
   I see this method is used to collect direct memory usage.
   
   https://github.com/apache/pulsar/blob/535415302ef6d1a9017f6ec25b87b24afd081155/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LoadManagerShared.java#L228-L229
   
   > I guess that requires a separate PR. Maybe we can push that PR in but add an issue to refactor (delete it out) the dependency on limit and only use percentage (which should be enough) or refactor to avoid using limit and be more specific like ResourceUsage.CpuUsage, MemoryUsage, NetworkUsage, each providing it's own method, where the CPU would only provide percentage.
   
   Yes, a separate PR makes sense to me.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
coderzc commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1214721207

   > I don't understand something. You're using an internal class from the JDK - it's internal - you're not supposed to use it as is no? How is the JDK itself using this - and maybe we can this info in a more public API way? Can you describe in more detail the motivation to get those metrics and why normal MBean metrics don't cut it?
   
   The `OperatingSystemMXBean` using `jdk.internal.platform.Metrics` to get CPU load.
   But `OperatingSystemMXBean` only exports CPU load, we also need the limit of CPU unfortunately, JDK did not export it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] heesung-sn commented on a diff in pull request #16832: [improve][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "heesung-sn (via GitHub)" <gi...@apache.org>.
heesung-sn commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r1180577209


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/LinuxInfoUtils.java:
##########
@@ -98,11 +138,14 @@ public static double getTotalCpuLimit(boolean isCGroupsEnabled) {
      * Get CGroup cpu usage.
      * @return Cpu usage
      */
-    public static double getCpuUsageForCGroup() {
+    public static long getCpuUsageForCGroup() {
         try {
+            if (metrics != null && getCpuUsageMethod != null) {
+                return (long) getCpuUsageMethod.invoke(metrics);

Review Comment:
   But your getCpuUsageMethod returns a different unit(rate), bounding [0,1].
   
   Don't we need to convert this rate to the aggregated cpu usage time?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "coderzc (via GitHub)" <gi...@apache.org>.
coderzc commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1524693296

   I will update this PR as soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] asafm commented on pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
asafm commented on PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#issuecomment-1214973632

   >And, for the CPU usage we have our own calculation formula, which is not exactly the same as OperatingSystemMXBean#getCpuLoad.
   Can you please explain why it is different? I do see that eventually, the load balancer needs the percentage right? how much was the CPU used relative to the limit which is 100%? getCPULoad of sun implementation does try to calculate that, no? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] nodece commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
nodece commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r934177765


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/LinuxInfoUtils.java:
##########
@@ -65,9 +87,14 @@ public static boolean isLinux() {
      */
     public static boolean isCGroupEnabled() {
         try {
-            return Files.exists(Paths.get(CGROUPS_CPU_USAGE_PATH));
+            if (metrics == null) {
+                return false;

Review Comment:
   If `jdk.internal.platform.Container` is unavailable, I think we should fall back to `CGROUPS_CPU_USAGE_PATH`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] nodece commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
nodece commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r945396731


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/LinuxInfoUtils.java:
##########
@@ -44,13 +46,39 @@ public class LinuxInfoUtils {
     private static final String CGROUPS_CPU_USAGE_PATH = "/sys/fs/cgroup/cpu/cpuacct.usage";
     private static final String CGROUPS_CPU_LIMIT_QUOTA_PATH = "/sys/fs/cgroup/cpu/cpu.cfs_quota_us";
     private static final String CGROUPS_CPU_LIMIT_PERIOD_PATH = "/sys/fs/cgroup/cpu/cpu.cfs_period_us";
+
     // proc states
     private static final String PROC_STAT_PATH = "/proc/stat";
     private static final String NIC_PATH = "/sys/class/net/";
     // NIC type
     private static final int ARPHRD_ETHER = 1;
     private static final String NIC_SPEED_TEMPLATE = "/sys/class/net/%s/speed";
 
+    @VisibleForTesting
+    public static Object /*jdk.internal.platform.Metrics*/ metrics;

Review Comment:
   ```suggestion
       private static Object /*jdk.internal.platform.Metrics*/ metrics;
       @VisibleForTesting
       public Object getMetrics() { return metrics; };
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] heesung-sn commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "heesung-sn (via GitHub)" <gi...@apache.org>.
heesung-sn commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r1178536739


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/LinuxInfoUtils.java:
##########
@@ -98,11 +138,14 @@ public static double getTotalCpuLimit(boolean isCGroupsEnabled) {
      * Get CGroup cpu usage.
      * @return Cpu usage
      */
-    public static double getCpuUsageForCGroup() {
+    public static long getCpuUsageForCGroup() {
         try {
+            if (metrics != null && getCpuUsageMethod != null) {
+                return (long) getCpuUsageMethod.invoke(metrics);

Review Comment:
   For backward compatibility, don't we need to multiply the limit in calculateBrokerHostUsage?
   ```
   public void calculateBrokerHostUsage() {
   ... 
          double totalCpuLimit = getTotalCpuLimit(isCGroupsEnabled);
           if (isCGroupsEnabled && metrics != null && getCpuUsageMethod != null) {
               // cgroup cpuUsage is already scaled, [0.0, 1.0]
               usage.setCpu(new ResourceUsage(cpuUsage * totalCpuLimit, totalCpuLimit));
           } else {
               usage.setCpu(new ResourceUsage(cpuUsage, totalCpuLimit));
           }
           
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] heesung-sn commented on a diff in pull request #16832: [improve][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "heesung-sn (via GitHub)" <gi...@apache.org>.
heesung-sn commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r1180611101


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/LinuxInfoUtils.java:
##########
@@ -98,11 +138,14 @@ public static double getTotalCpuLimit(boolean isCGroupsEnabled) {
      * Get CGroup cpu usage.
      * @return Cpu usage
      */
-    public static double getCpuUsageForCGroup() {
+    public static long getCpuUsageForCGroup() {
         try {
+            if (metrics != null && getCpuUsageMethod != null) {
+                return (long) getCpuUsageMethod.invoke(metrics);

Review Comment:
   I see. Thanks for the check.
   
   Ref:
   https://github.com/openjdk/jdk17/blob/master/src/java.base/share/classes/jdk/internal/platform/Metrics.java



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] nodece commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
nodece commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r943064682


##########
pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/SimpleBrokerStartTest.java:
##########
@@ -97,4 +101,31 @@ public void testNoNICSpeed() throws Exception {
     }
 
 
+    @Test
+    public void testCGroupMetrics() throws IllegalAccessException {
+        if (!LinuxInfoUtils.isLinux()) {
+            return;
+        }
+
+        boolean existsCGroup = Files.exists(Paths.get("/sys/fs/cgroup"));
+        boolean cGroupEnabled = LinuxInfoUtils.isCGroupEnabled();
+        Assert.assertEquals(cGroupEnabled, existsCGroup);
+
+        double totalCpuLimit = LinuxInfoUtils.getTotalCpuLimit(cGroupEnabled);
+        log.info("totalCpuLimit: {}", totalCpuLimit);
+        Assert.assertTrue(totalCpuLimit > 0.0);
+
+        if (cGroupEnabled) {
+            Assert.assertNotNull(FieldUtils.readStaticField(LinuxInfoUtils.class, "metrics", true));

Review Comment:
   I think you can add a getter method with `@VisibleForTesting` for these fields to avoid reflect. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] nodece commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
nodece commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r943183788


##########
pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/SimpleBrokerStartTest.java:
##########
@@ -97,4 +101,31 @@ public void testNoNICSpeed() throws Exception {
     }
 
 
+    @Test
+    public void testCGroupMetrics() throws IllegalAccessException {
+        if (!LinuxInfoUtils.isLinux()) {
+            return;
+        }
+
+        boolean existsCGroup = Files.exists(Paths.get("/sys/fs/cgroup"));
+        boolean cGroupEnabled = LinuxInfoUtils.isCGroupEnabled();
+        Assert.assertEquals(cGroupEnabled, existsCGroup);
+
+        double totalCpuLimit = LinuxInfoUtils.getTotalCpuLimit(cGroupEnabled);
+        log.info("totalCpuLimit: {}", totalCpuLimit);
+        Assert.assertTrue(totalCpuLimit > 0.0);
+
+        if (cGroupEnabled) {
+            Assert.assertNotNull(FieldUtils.readStaticField(LinuxInfoUtils.class, "metrics", true));

Review Comment:
   This test may be broken when someone modifies these fields.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by GitBox <gi...@apache.org>.
coderzc commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r945395452


##########
pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/SimpleBrokerStartTest.java:
##########
@@ -97,4 +101,31 @@ public void testNoNICSpeed() throws Exception {
     }
 
 
+    @Test
+    public void testCGroupMetrics() throws IllegalAccessException {
+        if (!LinuxInfoUtils.isLinux()) {
+            return;
+        }
+
+        boolean existsCGroup = Files.exists(Paths.get("/sys/fs/cgroup"));
+        boolean cGroupEnabled = LinuxInfoUtils.isCGroupEnabled();
+        Assert.assertEquals(cGroupEnabled, existsCGroup);
+
+        double totalCpuLimit = LinuxInfoUtils.getTotalCpuLimit(cGroupEnabled);
+        log.info("totalCpuLimit: {}", totalCpuLimit);
+        Assert.assertTrue(totalCpuLimit > 0.0);
+
+        if (cGroupEnabled) {
+            Assert.assertNotNull(FieldUtils.readStaticField(LinuxInfoUtils.class, "metrics", true));

Review Comment:
   Ok, it makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] heesung-sn commented on a diff in pull request #16832: [improve][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "heesung-sn (via GitHub)" <gi...@apache.org>.
heesung-sn commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r1180611101


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/LinuxInfoUtils.java:
##########
@@ -98,11 +138,14 @@ public static double getTotalCpuLimit(boolean isCGroupsEnabled) {
      * Get CGroup cpu usage.
      * @return Cpu usage
      */
-    public static double getCpuUsageForCGroup() {
+    public static long getCpuUsageForCGroup() {
         try {
+            if (metrics != null && getCpuUsageMethod != null) {
+                return (long) getCpuUsageMethod.invoke(metrics);

Review Comment:
   I see. Thanks for the check.
   
   Ref:
   https://github.com/openjdk/jdk17/blob/master/src/java.base/share/classes/jdk/internal/platform/Metrics.java
   
   
   https://github.com/openjdk/jdk17/tree/master/src/java.base/linux/classes/jdk/internal/platform



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] coderzc commented on a diff in pull request #16832: [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Posted by "coderzc (via GitHub)" <gi...@apache.org>.
coderzc commented on code in PR #16832:
URL: https://github.com/apache/pulsar/pull/16832#discussion_r1180389710


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/LinuxInfoUtils.java:
##########
@@ -98,11 +138,14 @@ public static double getTotalCpuLimit(boolean isCGroupsEnabled) {
      * Get CGroup cpu usage.
      * @return Cpu usage
      */
-    public static double getCpuUsageForCGroup() {
+    public static long getCpuUsageForCGroup() {
         try {
+            if (metrics != null && getCpuUsageMethod != null) {
+                return (long) getCpuUsageMethod.invoke(metrics);

Review Comment:
   The `Metrics.getCpuUsage` will return the aggregate time, so I think we don't need to multiply the limit.
   
   ```
       /**
        * Returns the aggregate time, in nanoseconds, consumed by all
        * tasks in the Isolation Group.
        *
        * @return Time in nanoseconds, -1 if unknown or
        *         -2 if the metric is not supported.
        *
        */
       public long getCpuUsage();
   ```



##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/LinuxInfoUtils.java:
##########
@@ -98,11 +138,14 @@ public static double getTotalCpuLimit(boolean isCGroupsEnabled) {
      * Get CGroup cpu usage.
      * @return Cpu usage
      */
-    public static double getCpuUsageForCGroup() {
+    public static long getCpuUsageForCGroup() {
         try {
+            if (metrics != null && getCpuUsageMethod != null) {
+                return (long) getCpuUsageMethod.invoke(metrics);

Review Comment:
   The `Metrics.getCpuUsage` will return the aggregate time, so I think we don't need to multiply the limit.
   
   ```java
       /**
        * Returns the aggregate time, in nanoseconds, consumed by all
        * tasks in the Isolation Group.
        *
        * @return Time in nanoseconds, -1 if unknown or
        *         -2 if the metric is not supported.
        *
        */
       public long getCpuUsage();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org